Designing for Scale: How eG Enterprise Manages Millions of Metrics with AIOps-driven Self-Monitoring

Customers evaluate a modern observability and monitoring solution by the ROI they get, self-monitoring capabilities ultimately improve scalability and quality. The value of any observability solution lies in its ability to proactively detect and alert customers to issues before they cause a business-impacting outage. IT infrastructures and applications can fail in many different ways. A run-away process, a memory leak in an application, an application that fails to release database connections, a mis-configured system, etc. can all result in IT outages, and therefore, it is imperative that a modern observability solution collects and analyzes a wide range of different metrics. For a single system, thousands of metrics may have to be collected and analyzed in real-time. For a large infrastructure with thousands of systems and applications, it is no surprise that eG Enterprise has to collect and analyze millions of metrics.

Screenshot of an eG Enterprise system monitoring 12 million+ metrics

Figure 1: eG Enterprise collecting over 12million metrics in a real-world infrastructure

Figure 1 shows a snapshot of the eG Enterprise console of a production deployment managing over 6000 components. 12 million measurements is a lot of metrics, logs, traces and tests! Far more than a human operator can process or overview and only possible because of powerful AIOps-engine at the heart of eG Enterprise. Without AIOps capabilities, the burden of configuring thresholds for each metric, manually setting alerts and tracking them over time is put on the human administrator – an impossible job!

Scaling an observability platform to process tens of millions of events, metrics, logs, and traces daily is no small feat. eG Enterprise achieves this with a combination of efficient design principles, AIOps-powered automation, and self-monitoring capabilities that ensure continuous optimization.

We also have to pay considerable attention as to how this data gets processed and presented into human-consumable overviews. This blog covers how eG Enterprise scales to collect, analyze and report on millions of metrics.

Step 1: Design to Minimize Inefficiencies

A distributed monitoring approach is used to scale the monitoring. As the target infrastructure scales, you can add additional “remote” monitors for agentless monitoring of the infrastructure. Multi-threading concepts are used to ensure that the monitoring can scale as required. And you can configure the remote monitors to operate in a highly available cluster to avoid a single point of failure.

While the management server is the heart of the eG Enterprise system, storing and analyzing metrics, the actual analysis of the metrics is performed by the agents. This distributes the workload across the infrastructure, minimizing the chance of the eG Enterprise manager becoming a single point bottleneck.

eG Enterprise also uses a concept of management by exception. Its state analysis and alarm correlation engine only analyzes problematic metrics and correlates them, thereby minimizing the analysis effort that it performs.

The data processing, analysis and user interfacing components are segregated from one another, ensuring that there’s minimum interference between these components. Modern data processing mechanisms such as data partitioning are used to ensure that even if the database grows to many Terra-bytes, the monitoring system remains functioning at its peak.

To minimize data collection and storage overheads, the default models of the system are designed with scalability in mind. For example, there is no point sampling Azure billing metrics every 3 seconds as they are only updated a few times a day.

While scalability of the management system and the agents are important, it is equally important for the communication between the management system and agents to be efficient. Inefficiencies can result in overloading the network, impacting the performance of the very infrastructure that the monitoring system is supposed to oversee. eG Enterprise leverages caching where appropriate to ensure that configurations are not repeatedly communication. Compression techniques are used to minimize bandwidth usage for metrics storage, accelerating performance and lowering costs.

eG Enterprise continually self-monitors every deployment so we get continual feedback and a deep understanding of how the platform behaves under different customer use cases. These insights and understandings from self-monitoring are fed into iterations of the solution’s architecture.

Step 2: Scale Horizontally as well as Vertically

image showing a small server becoming a big server to explain vertical scaling (scaling up) For a scalable architecture, it is recommended to have the management server and database on separate systems. In a cloud deployment, a database service like Amazon RDS or Azure SQL can be used as well. Compute and data processing resources in the eG Enterprise architecture can be distributed and scaled as required, by allocating additional compute, memory and storage resources. A sizing calculator is available to compute the resources needed for the management server and database.

The “remote” monitors used for agentless monitoring can also be scaled up if required to monitor hundreds of devices or virtual machines. Multi-threading is used extensively on the manager and agents to achieve scale up.

image showing using an increasing number of servers added to illustrate horizontal scaling (scale out) Horizontal scaling is supported by enabling multiple eG Managers to operate in parallel and report to a central Super Manager. This hierarchical setup allows organizations to monitor expansive IT environments efficiently by distributing workloads across several eG Managers, each handling specific segments of the infrastructure. The Super Manager aggregates data from all eG Managers, providing a unified view of performance and health metrics without overloading individual components.

Horizontal scaling (also described as scaling out) is fundamental to how cloud platforms scale – we’ve some history and background on the fundamentals of scaling methodologies available here: Understanding Scale Up vs. Scale Out – And Why You Need to Understand Scale Up vs. Scale Out to Be a Nutanix or HCI Guru | eG Innovations Understanding scale up vs. scale out – and why do you need to understand scale up vs. scale out to be a Nutanix or HCI Guru?.

Note that the data collection, analysis and retention is still performed by the individual managers, to ensure that the SuperManager does not become a single point bottleneck. The SuperManager configuration can be used for scalability – to handle tens of thousands of systems and hundreds of thousands of devices and VMs. This architecture can also be used for efficiency when handling geographically distributed environments. The agents can report to the manager local to them, ensuring that data does not leave their local region and minimizing bandwidth usage. At the same time, this architecture also allows for distributed administration: each region administers and manages their local deployment and the SuperManager is mainly for consolidated monitoring across regions.

Step 3: Code-level Efficiency – Adhere to Best Practice and then Self-Monitor

eG Enterprise makes extensive use of Java technologies, and we adhere to Java best practices for scalability and efficiency. Best practices we follow are well documented in our earlier blogs and include:

JVM tuning, optimizing garbage collection and leveraging thread pooling to avoid bottlenecks: https://www.eginnovations.com/blog/how-to-enhance-performance-java-applications/
Java coding best practices for scalability: https://www.eginnovations.com/blog/6-tips-fast-java-applications/
How to enhance database access performance from Java applications: https://www.eginnovations.com/blog/java-application-performance-tips/
Java application performance monitoring best practices including detecting JVM-related issues, such as memory leaks or CPU spikes, etc.: see: Java Application Performance Monitoring White Paper.
Configuring full stack visibility for proactive monitoring and troubleshooting: How to Get Full-Stack Visibility for Your Java Applications – A Comprehensive Guide | eG Innovations

In How to make Java run faster – 6 Tips | eG Innovations we covered some common best practice Java coding advice and show how simple code optimizations directly impact on JVM resource usage and in turn cloud computing costs. See figure 2 below.

Figure 2: A single code change can reduce your CPU usage by 60%. Not only saving you money in the cloud but increasing your capacity to scale.

We also make extensive use of eG Enterprise’s APM capabilities. For instance, using eG Enterprise’s transaction tracing capabilities, we can detect inefficiencies in our code or queries and address them at the earliest opportunity in the development lifecycle.

You can read more of our recent Java best practice blogs, here: Java Monitoring Archives | eG Innovations.

Step 4: Enable Complete Visibility and Self-Monitoring

Over the years, we have built extensive self-monitoring capabilities into eG Enterprise. Almost every aspect of eG Enterprise is monitored. This includes it’s AIOps engine, the database purging process, the integration with ITSM systems like ServiceNow, the database storage, interfaces with email systems, and so on. Self-monitoring of eG Enterprise is enabled by default in every deployment and allows us to monitor the workings of the eG Enterprise management system and our agents in-depth. This self-monitoring capability allows our team to pinpoint inefficiencies, identify scaling challenges, and apply timely fixes.

Self-monitoring also helps us troubleshoot issues quickly in customer deployments. E.g., from the metrics eG Enterprise collects, we can highlight if a customer’s database is not working well, or if their email system has become slower than normal, thereby impacting eG Enterprise’s performance. This helps our customer support teams troubleshoot issues quickly and resolve them.

Customers themselves can also review the eG Enterprise metrics and detect and resolve issues by themselves.

Screen shot showing the eG Enterprise console self-monitoring the eG Enterprise monitoring tool's database indexing. Auto-indexing and rebuild metrics are accessible to the end user

Figure 3: eG Enterprise allows our customers to get deep insights into the computation the eG Manager is performing even down to details such as how the automatic database rebuild process is performing.

A wonderful thing about eG Enterprise’s APM (Application Performance Monitoring) is that you can use it to monitor all your other IT applications and management tools, beyond just end-user apps. You can get visibility into how efficient all those Java, .NET, Node.js and PHP tools really are. This is particularly important if those apps are being run as native cloud apps backed by cloud resources that you are paying for. We recently covered a case study analyzing the performance and efficiency of a .NET app popular for managing AVD (Azure Virtual Desktop) deployments, where scaling the database unnecessarily (with additional costs) can be avoided using an enterprise monitoring tool, see: Monitoring and Troubleshooting Nerdio | eG Innovations.

This type of monitoring is also important if you are looking to migrate any applications to a cloud, a good overview of how to leverage APM to shift-left in such a migration is covered in, Shift Left Monitoring: A Pathway to Optimized Cloud Applications | eG Innovations.

Step 5: Leverage AIOps for Continuous Optimization

AIOps automation powers eG Enterprise’s ability to process millions of metrics seamlessly. By analyzing this volume of data in real-time, the platform automatically identifies anomalies, automates root-cause analysis, and can recommend improvements or even automate self-healing remediation. This approach in our self-monitoring ensures that eG Enterprise remains resilient and agile, even within dynamic and ephemeral modern IT systems where auto-scale events are now a normal part of operations (environments such as OpenShift, Kubernetes). Moreover, it ensures that every application and infrastructure component benefits from the same comprehensive analysis – even if that necessitates monitoring 12 million measures!

Figure 4: AIOps means eG Enterprise can build rich topology maps which are overlaid with the diagnostics from the AIOps-powered root-cause analysis to differentiate the primary root-cause from secondary alerts. Here the root-cause issue is with a Java application that is affecting the performance of the IIS Web Server. The emulated client trying to access the IIS server is also affected but the interface guides the operator to the primary issue first.

Importantly, AIOps removes manual effort which makes a significant difference in large-scale systems where manual deployment is unfeasible. Even within auto-scaled environments eG Enterprise can ensure day-0 coverage via auto-discovery, auto-deploy and universal agent technologies. Moreover, AIOps auto-baselining combined with domain-aware intelligence means you get metric thresholds, alerting and anomaly detection out-of-the-box. Even at moderate scales, a human operator simply can’t spot problems, let alone anomalies and must rely on their monitoring and observability tooling to proactively do so.

Scalability in Monitoring needs to be Affordable

If you are monitoring 12 million metrics, it needs to be cost effective. Many observability solutions charge per metric and per alert. Azure Monitor charges $0.10 per alert per month. This means that many aren’t able to scale their monitoring coverage not due to architectural limitations in monitoring tools or the infrastructure resources demanded to process data at the scales needed – but simply because they can’t afford to turn monitoring on! eG Enterprise differentiates itself for many in that we avoid licensing models that scale unaffordably – see: eG Enterprise IT Monitoring Licensing – Cost-Effective & Flexible. Scalability shouldn’t mean it is your costs that are scaling!

Summary

By following the best practices described above and by leveraging the same AIOps-driven insights we offer to customers, eG Enterprise ensures smooth operation of our own solution at scale. Whether monitoring thousands or millions of metrics, the platform remains agile, resilient, and ready to evolve with business needs (and at an affordable cost!).

eG Enterprise is an Observability solution for Modern IT. Monitor digital workspaces,
web applications, SaaS services, cloud and containers from a single pane of glass.

Free Trial See the platform

Designing for Scale: How eG Enterprise Manages Millions of Metrics with AIOps-driven Self-Monitoring

Step 1: Design to Minimize Inefficiencies

Step 2: Scale Horizontally as well as Vertically

Step 3: Code-level Efficiency – Adhere to Best Practice and then Self-Monitor

Step 4: Enable Complete Visibility and Self-Monitoring

Step 5: Leverage AIOps for Continuous Optimization

Scalability in Monitoring needs to be Affordable

Summary

You may also like

DORA Compliance – An Opportunity for MSPs

Get Observability 2024 – eG Enterprise Korea – Event Review

Python logging format: Best practices for monitoring and troubleshooting

DORA compliance – An opportunity for MSPs

What is the digital operational resilience act ( DORA)? everything you need to know about DORA compliance

Designing for Scale: How eG Enterprise Manages Millions of Metrics with AIOps-driven Self-Monitoring

Step 1: Design to Minimize Inefficiencies

Step 2: Scale Horizontally as well as Vertically

Step 3: Code-level Efficiency – Adhere to Best Practice and then Self-Monitor

Step 4: Enable Complete Visibility and Self-Monitoring

Step 5: Leverage AIOps for Continuous Optimization

Scalability in Monitoring needs to be Affordable

Summary

You may also like

DORA Compliance – An Opportunity for MSPs

Get Observability 2024 – eG Enterprise Korea – Event Review

Related Blogs

Python logging format: Best practices for monitoring and troubleshooting

DORA compliance – An opportunity for MSPs

What is the digital operational resilience act ( DORA)? everything you need to know about DORA compliance