A Comprehensive Guide for SREs
Build Foundational Knowledge on How etcd fits into the Kubernetes Ecosystem

image of stamp imprint saying "Mission Critical" to illustrate etcd is critical for application performance in many use casesetcd is the single source-of-truth data store for the Kubernetes cluster. As a Key-Value store with advanced features, etcd stores mission-critical Kubernetes data: configuration data, the cluster state and metadata. This information is key for the Kubernetes cluster to scale and self-heal.

If etcd malfunctions, it can cause failures on the Kubernetes cluster. You won’t be able to deploy any changes to the Kubernetes cluster and hence business SLAs could be severely impacted.

etcd is also complex because it has several moving parts. etcd’s sub-systems such as BoltDB, MVCC, Raft, WAL (Write-ahead Log) and gRPC can be the cause of varied performance issues that range from leader election inconsistencies, network partitions, fsync issues to the etcd database going out of space. Each of these could put the Kubernetes cluster stability at risk.

This blog (and the rest of articles in this series) will help you establish the foundational knowledge of etcd to help you craft the right observability and monitoring strategy for top etcd and Kubernetes performance.

In this article, I will walk you through:

  • etcd in Kubernetes – context setting
    • What is etcd and what are its key features?
    • The types of Kubernetes data stored in etcd
    • What are Kubernetes’ requirements of a backing store?
    • What makes etcd an ideal fit for Kubernetes?
  • An outline of etcd’s sub-systems
    • Raft consensus algorithm
    • gRPC for cluster communication
    • Write-Ahead Logging (WAL) & Snapshotting
    • BoltDB
  • High-level overviews of potential issues in etcd
  • Key Takeaways for Site Reliability Engineers (SREs)

What is etcd?

Let’s kick things off with a description that is a bit more digestible than the official etcd definition. I’ve highlighted the essential features of etcd.

etcd (pronounced “ett-see-dee”) is a key-value store for distributed systems such as Kubernetes. etcd itself is distributed across a cluster of machines.

With etcd, you get a reliable and consistent way of storing critical data even when machines fail. You also get reactive watch functionality, enabling real-time updates on changes to the data.

etcd is an open source and CNCF project.

The key phrase here is “strong consistency guarantees”, which is exactly what Kubernetes needs to store critical cluster data.

The Benefits of etcd – Unpacking etcd’s Benefits

There is a lot going on in the above definition. Let’s unpack it into individual parts and focus on the benefits as they pertains to Kubernetes.

A diagram showing the benefits of etcd - Benefits shown include: Open source, distributed & replicated, fast, MVCC, reliable and strongly consistent

Figure 1: Key features and properties of etcd

etcd has a robust architecture tailored to meet the demands of distributed systems such as Kubernetes. Let’s look at each piece of functionality and their benefits in greater detail.

etcd functionality Explanation Benefit(s)
Key-Value Store etcd stores data as key-value pairs, making it simple to store and retrieve Kubernetes data.

Simple data model stored in flat K-V format in a 3-tuple structure that allows for more efficient storage and retrieval of data.

Each resource in a Kubernetes cluster has a key in the database.

Distributed, replicated etcd is spread across multiple machines, ensuring no single point of failure. etcd operates with a leader-follower model.

Multiple benefits: fault tolerance, scalability, data redundancy, high availability and a consistent data view.

Every etcd node has access to the complete data store and is eligible to become the primary data source.

With no single points of failure, you get robustness against hardware failures and network partitions.

Strongly Consistent Ensures strict “serializability” and global ordering of events.

Every Kubernetes client sees the latest data after a successful write operation.

You get a reliable and ordered view of events, eliminating stale or outdated data reads.

Reliable Ensures all nodes agree on a single source of truth.

Guarantees data integrity and cluster agreement with the “Raft” consensus algorithm.

etcd remains operational even when machines fail. You could have hardware failures or network partitions and etcd will continue to function as long as there is quorum.

(Quorum refers to a minimum number of active etcd members needed for consensus to modify the cluster state.)

Watch notifications

Real-time data change notifications.

(Killer feature of etcd)

Enables real-time updates and notifications on data changes.

etcd monitors changes to specific keys, enabling real-time event-driven workflows in Kubernetes. This allows Kubernetes to react immediately to cluster state changes.

Multi-Version Concurrency Control (MVCC) Avoids data conflicts in a distributed environment and allows multiple clients to access and modify data simultaneously. etcd maintains multiple versions of a key-value pair, with each revision representing a different version of the data.
Fast High write performance

etcd has been benchmarked at 30,000 writes per second under heavy load with a standard cluster configuration.

However, SREs have to ensure fast disks such as SSDs and allocate enough memory for etcd to cache key-value data and track watchers.

etcd is written in the Go language, which is consistent with the k8s technology stack, has low resource consumption, and is extremely easy to deploy.

Open Source The etcd project is open source.
etcd is also a graduated Cloud Native Computing Foundation (CNCF) project.

Community involvement and transparency in development.

Mature product with CNCF’s backing among other cloud-native technologies.

What’s with the name?

The name “etcd” is a clever nod to the Linux directory structure, with a twist. In UNIX, the “/etc” directory is the place for system configuration files.

etcd creators tacked on a “d” to form “etcd”, to signify “distributed”. This tells us that the intention of etcd is to manage configuration data across a distributed system (such as Kubernetes).

We’ve explored the key functional features of etcd. Let’s now turn our focus to how Kubernetes utilizes etcd to organize and manage cluster data.

3 Types of Kubernetes Data Persisted in etcd

As a container orchestrator, Kubernetes needs to persist a variety of data types. Specifically, we will delve into the three essential types of data stored within etcd — state data, configuration data and metadata.

Image representing how Kubernetes state data, configuration data and meta data feed into each other

All of these are stored in key-value format. Let’s unpack each of these categories.

Kubernetes Data Type Description Details
State Data

Reflects the current status of cluster resources.

Crucial for monitoring and managing the health and performance of the Kubernetes cluster.

Pod statuses, node availability, service endpoints
Configuration Data

Defines desired configurations (behaviors and properties) for cluster resources.

Kubernetes uses declarative configuration management. You simply state the desired state of the cluster.

Deployment configurations, network policies, resource limits
Metadata

Provides additional information about resources.

Allows for easy identification, selection, and organization of resources based on their attributes and behavior.

Labels, annotations, resource names such as UUIDs, and controller details.

Below is a typical YAML representation that outlines various types of Kubernetes objects which are typically stored in etcd.

Screenshot of a YAML configuration file for an etcd deployment

Figure 3: Example of a Kubernetes deployment manifest. Different parts of this data are stored in etcd as key value pairs.

To give you a sense of the wide variety of data that’s stored in etcd, I’ve outlined the core purpose of each Kubernetes object along with the way etcd interacts with API Server.

Kubernetes Object Purpose of object Interaction with API Server and ETCD
Pods Run containers within the cluster API server watches for changes in Pod status through etcd and updates scheduling accordingly.
Services Group pods and define network access policies Service details are stored in etcd for service discovery and load balancing configurations.
Ingress Manage external HTTP access to services Ingress configurations are stored in etcd and used by the API server to route external traffic.
ConfigMaps Store configuration data as key-value pairs Configurations are retrieved from etcd by the API server to set up environment-specific application settings.
Secrets Securely store sensitive information Sensitive data is accessed through the API server, which pulls from etcd, ensuring secure handling.
Namespaces Isolate and segment cluster resources Namespace definitions in etcd help the API server enforce access controls and quotas.
Deployments Maintain and update the desired state of applications Deployment states are stored in etcd, with the API server orchestrating updates and rollbacks.
StatefulSets Manage stateful applications The API server utilizes etcd to track the unique identifiers and state of each pod in a StatefulSet.
DaemonSets Ensure pods run on all or certain nodes DaemonSet configurations are kept in etcd, guiding the API server to schedule pods on specific nodes.
ReplicaSets Keep a stable set of replica pods running Replica counts and pod statuses are synced with etcd, allowing the API server to manage scaling.
Jobs Execute batch jobs to completion Job definitions and completion status are tracked in etcd for the API server to manage execution.
CronJobs Schedule periodic jobs Scheduling data is stored in etcd, and the API server triggers CronJobs as configured.
PersistentVolumes Provide storage independent of pod lifecycle Storage details are held in etcd for the API server to manage persistent storage allocations.
PersistentVolumeClaims Request user storage Claims are tracked in etcd, with the API server matching them to available PersistentVolumes.
Roles Define namespace-specific permissions Role definitions are stored in etcd for the API server to enforce access within namespaces.
ClusterRoles Define cluster-wide permissions ClusterRole definitions in etcd inform the API server’s cluster-wide permission enforcement.
RoleBindings Assign roles to users within namespaces Binding details are kept in etcd, used by the API server to grant user permissions.
ClusterRoleBindings Assign cluster roles to users The API server reads from etcd to bind users to cluster-wide roles.
NetworkPolicies Specify pod communication rules Policies are stored in etcd, with the API server applying them to control traffic flow.
Endpoints Map IP addresses and ports to services Endpoint information in etcd is used by the API server for network connectivity to services.
CRDs Extend Kubernetes resources CRD specifications are stored in etcd, with the API server using them to extend the Kubernetes API.

Each object type serves a unique role, ranging from operational configurations to security measures, and plays a critical part in the orchestrated management of containerized applications in a Kubernetes environment.

As a key-value store, etcd is an ideal fit for Kubernetes to store these varied types of objects. At its core etcd’s responsibility is quite simple:

  • Create and delete key/value pairs
  • Watch keys (more on this in a later section).

Now that we have an understanding of the types of Kubernetes objects that need to be persisted, let’s take a look at Kubernetes architecture and where etcd fits into it.

What does Kubernetes expect of a data store?
Requirement #1: Strong consistency. Kubernetes needs to exactly know how many pods to scale and cannot afford to rely on outdated data. etcd prioritizes consistency over latency and availability, ensuring accurate and reliable state information to manage service scales effectively.
Requirement #2: “Watch” and change notification. Kubernetes requires a “watch” feature to monitor cluster changes like scaling, configuration updates, or failure recovery. Polling the data store frequently would be inefficient. When a change is detected in etcd—such as a new Pod being scheduled, a deployment being updated, or a node changing its status—etcd sends a real-time update to Kubernetes. Kubernetes can get notified of key-value store modifications, enabling prompt responses to events like pod scheduling or node status updates.
Requirement #3: Availability. etcd ensures that the Kubernetes control plane remains operational even in the event of control plane failure or node failure, ensuring that existing workloads on worker nodes continue to serve requests.
etcd can be configured in a multi-node setup where data is replicated across several nodes. This setup not only helps in achieving high availability but also in preserving data integrity and consistency. If one etcd node goes down, the others continue to serve the cluster’s data needs without loss of information.

Requirement #4: High volume reads and writes. etcd allows reads from any cluster member, not just the leader, optimizing read performance. There is also the ability to cache reads at the API Server level. From a write perspective, Kubernetes needs the ability to manage write operations to prevent slowdowns during critical updates, maintaining cluster stability. While writes are slower due to Raft driven consensus processes, etcd ensures they are reliable.

Where Does etcd Fit into the Kubernetes Architecture?

Kubernetes is a container orchestrator. Let’s start by looking at a high-level conceptual view of Kubernetes. Later, we’ll expand it into a full architectural view.

Schematic of how worker Kubelets are controlled via etcd within the control plane

Figure 4: The control plane and worker nodes in Kubernetes

At a very high level, you can divide Kubernetes into 3 categories:

  • Hardware Infrastructure: Comprises of the physical and virtual resources necessary for the cluster, including compute servers, network infrastructure, storage systems, and a container registry.
  • Kubernetes Worker Nodes: Serve as the fundamental computational units within a Kubernetes cluster, executing applications and managing workloads.
  • Kubernetes Control Plane (aka Kubernetes Master): Acts as the command center for the cluster, encompassing the API server, scheduler, controller manager, and various control loops that regulate state and manage resources.

The master node acts as the control plane, managing the state of the cluster. The worker nodes host the actual applications in the form of containers, managed by the master.

The control plane itself has 4 key sub-systems within a Kubernetes cluster:

  • Within the worker and control plane nodes:
    • Kubelet: This is a node-specific process or an agent running on each node. Kubelet looks at the desired configuration and sets up pods, attaches containers to the network and volume mounts.
  • Exclusively within the control plane nodes :
    • Scheduler: Responsible for allocating pods to nodes based on resource availability and other scheduling constraints.
    • Controller Managers (KCM and CCM): The Kubernetes Controller Manager (KCM) oversees a collection of controllers. These controllers are background processes that handle routine tasks in the cluster, ensuring that the current state of the system matches the desired state specified by the user. The Cloud Controller Manager (CCM) manages cloud provider-dependent logic, such as managing services and routes within a Kubernetes cluster.
    • API Server: Serves as the frontend to the cluster’s shared state, allowing users and internal components to interact and configure Kubernetes resources. The API server that acts as the “glue” or “hub”, facilitating communication between control plane processes and nodes. It connects the internal controllers to the kubelet. API Server is the only component that talks to etcd as you will see below.

With the above in mind, we can now look at a more technical Kubernetes logical architecture diagram.

Diagram showing how etc interacts with the Kubernetes control plane via the kube-api-server

Figure 5: Kubernetes cluster components adapted from Kubernetes.io

As you can see, the API server is the only component on the control plane that communicates with etcd. As an SRE, you must monitor the health, availability and performance of both API server and etcd.

Kubernetes needs a strongly consistent and reliable data store to manage distributed containerized workloads. These tasks can be visualized as follows.

etcd acts as a centralized repository for all cluster information. This includes tracking the current and desired states, resource configurations, and runtime data. For each task, etcd holds vital information and via the API server, enables Kubernetes to achieve these tasks.

Kubernetes Task How etcd Supports This Task
Configuration etcd stores all configuration data, ensuring consistent settings across the entire cluster. This includes network configurations, resource limits, and other operational parameters.
Deployment etcd tracks the desired state of applications, including versions and replicas. It updates this state as new rollouts occur, enabling Kubernetes to automate deployment processes effectively.
Service Discovery etcd holds records of all services and their endpoints, which Kubernetes queries to enable dynamic discovery and connectivity among distributed applications.
Load Balancing By storing the status and performance metrics of pods and nodes, etcd helps Kubernetes distribute loads efficiently, matching resource requests with the current capacity of the cluster.
Job Scheduling etcd maintains a queue of scheduled jobs and their statuses, allowing Kubernetes to plan and execute jobs intelligently based on available resources and current cluster needs.
Health Monitoring etcd continuously receives and updates health metrics from the Kubernetes control plane, enabling quick detection of failed components and triggering of automatic recovery processes.
Is etcd the only game in town for Kubernetes?

etcd has long been the default choice for Kubernetes’ data storage needs. However, new entrants are emerging as viable substitutes.

  • Kine: supports various SQL databases like MySQL, PostgreSQL, SQLite, and dqlite. PostgreSQL is a particularly interesting alternative. Both postgres and etcd prioritize I/O performance in their storage design. Both systems optimize batch operations for sequential disk reads and writes. A good comparison is authored by Jinhua Luo here.
  • Dqlite/ SQLite: MicroK8s uses Dqlite (distributed SQLite) as its default datastore instead of etcd. Dqlite extends SQLite with high availability and network replication features. However, the community has reported problems with production workloads.

Kubernetes and etcd: Operational Scenarios for SREs

etcd maintains the desired state and configuration of the entire cluster. Any changes or updates to the cluster state, whether initiated by administrators or automated controllers, are first recorded in etcd. This ensures that the actual state of the cluster can be reconciled with the desired state stored in etcd.

Let’s explore some real-world scenarios where etcd’s functionalities are pivotal within a Kubernetes context.

Scenario Description Real-world Example
Pod Creation When an SRE creates a new pod using the Kubernetes API, the request is received by the Kube API Server, which validates the request and stores the desired state in etcd. An SRE deploys a new microservice by creating a pod manifest. The Kubernetes API Server stores the pod specification in etcd, and the cluster’s scheduler uses this data to schedule the pod onto a suitable node.
Scaling Deployments When an SRE scales a deployment, such as increasing the number of replicas, the Deployment controller updates the desired state in etcd, ensuring that the actual state converges to match the desired state. In response to increased traffic, an SRE scales up a web application deployment from three replicas to five replicas. The Deployment controller updates the desired state in etcd, and the cluster automatically creates additional pods to meet the new desired state.
Cluster Configuration All configuration data for the Kubernetes cluster, including information about nodes, pods, services, and namespaces, is stored in etcd. Any changes to the cluster configuration are reflected in etcd. An SRE updates the network policy for a specific namespace using kubectl. The Kube API Server validates the request and updates the corresponding configuration in etcd, ensuring that the new network policy is enforced across the cluster.
Cluster State Recovery In the event of a cluster failure or restart, Kubernetes relies on etcd to reconstruct the cluster state. All cluster data stored in etcd is used to bring the cluster back to its last consistent state. Following a sudden power outage, the Kubernetes cluster restarts. The Kube API Server retrieves the cluster state from etcd, allowing the cluster to resume its operations from where it left off before the outage.

4 Key Sub-systems of etcd

etcd meets Kubernetes’ requirements through its robust architecture, which is built around four key sub-systems.

A diagram showing how the 4 subsystems of etcd interact - the 4 subsystems are gRPC, Raft engine, WAL and bbolt.

Each of these sub-systems plays a vital role in ensuring that etcd is not just a data store, but a dynamic, reliable, and integral part of the Kubernetes ecosystem. In this section, we’ll delve into these four sub-systems, exploring how they contribute to the performance, reliability, and consistency of etcd, and by extension, the entire Kubernetes cluster.

Sub-system #1: Raft Consensus Algorithm

TL;DR: Data consistency across the cluster.

Raft is a consensus algorithm designed for high availability and fault tolerance in distributed systems such as etcd. etcd uses Raft to ensure data consistency across multiple nodes. Raft manages leadership elections and replicates entries across all nodes to maintain a highly available system.

Key problem that Raft solves: Nodes and network failures happen all the time in distributed systems. Loss of node machines or network should not result in loss of critical service configurations and state data. Raft ensures that a consistent view of the etcd data is available across the cluster, even in the presence of node failures or network partitions.

How Raft Solves This Problem: Raft handles this through leader elections and log replication processes. etcd has the concept of a leader and one or many followers. Ideally, all changes to the database are made by a single leader within a cluster. When a node fails, Raft triggers a new election to choose a leader which then manages the data replication process to ensure all active nodes are synchronized and up to date.

Raft has the concept of a “proposal” This is a request that is sent to the Raft cluster for processing. The cluster receives proposals from members, commits them, and applies them to the local store.

Sub-system #2: Write-Ahead Log (WAL)

TL;DR: Data integrity and aids in recovery. Every change is logged before execution to support etcd’s reliability in Kubernetes environments.

We saw the concept of how Raft processes proposals. These proposals need to be stored somewhere for processing and tracking.

The Write-Ahead Log (WAL) in etcd serves as a Raft persistent storage mechanism for storing proposals. When a leader receives a proposal, it first appends this proposal to its own log in the WAL. Simultaneously, the leader replicates this proposal to its followers using the Raft consensus protocol. Each follower then independently appends the proposal to its own WAL before acknowledging it back to the leader. This concurrent replication process ensures that the proposal is safely stored in the WAL of the leader and its followers.

Key problem that WAL solves: When unexpected failures (like crashes or power outages), etcd should not lose the critical Kubernetes state data. Ideally, you need a snapshot of transactions in the etcd cluster that can be used to rebuild the etcd state if needed. The key idea is that state changes should be logged prior to any heavy update to permanent storage.

How WAL solves the problem: WAL ensures durability, consistency and recovery of Raft changes. WAL records every change before it is committed to etcd’s datastore. This log plays a critical role in recovery by providing a replay of events to bring etcd back to its last known good state before the failure, thus preventing data loss.

Sub-system #3: BoltDB (BBolt)

TL;DR: Store data on disk efficiently for fast, durable access. Crucial for Kubernetes’ rapid data retrieval needs.

BoltDB is the embedded key-value storage engine used by etcd for storing data on disk. It provides high performance, efficiency, and ACID (Atomicity, Consistency, Isolation, Durability) transactions, making it suitable for etcd’s needs. You get a high-performance key-value storage engine with BoltDB.

Internally BoltDB uses MVCC – Multi-Version Concurrency Control. MVCC is a mechanism used by etcd to manage concurrent access to data. It maintains multiple versions of data to ensure consistency during transactions. This allows for a key-value store to be viewed from past revisions since historical key revisions are retained.

This model is particularly useful in etcd as it ensures that the key-value store can be safely accessed and modified by multiple clients without conflicts or data loss.

Key problem that BoltDB solves: Efficient and reliable storage is needed for the key-value pairs that etcd manages, accommodating high throughput and low latency access to data.
High performance and efficient data access are critical for the responsiveness and stability of applications that rely on Kubernetes. Slow or inefficient data operations can lead to increased latency and reduced application performance.

How BoltDB solves the problem: BoltDB uses a B+ tree indexing structure which enables efficient read/write operations and helps in compact data storage. This structure is particularly effective in handling large datasets with minimal performance overhead, making it suitable for the high-performance requirements of etcd.

Sub-system #4: gRPC

TL;DR: efficient and scalable client-server communication between nodes and between API server and etcd.

gRPC is utilized for client-server communication within etcd, facilitating efficient, high-performance remote procedure calls (RPCs) that support the cluster’s distributed nature.

Key Problem Solved: Effective communication between distributed etcd nodes and between clients and etcd servers must be maintained to handle high volumes of requests efficiently.

Why It Is Important to Solve: Robust and scalable communication is essential for distributed systems to ensure that data updates and requests are processed reliably and without significant delays, which are critical for maintaining the operational integrity of Kubernetes clusters.

How It Solves This Problem: gRPC facilitates this communication through its support for advanced features like streaming and multiplexing. This enables etcd to manage a high volume of network traffic and data distribution demands efficiently, ensuring that communication between nodes and clients remains fast and reliable.

A client request to write data would involve communication from the gRPC server to the etcd server. The etcd server uses the Raft consensus algorithm to replicate the write across the cluster, and finally ensures the write is logged in the WAL before being applied to the store.

High-level Walkthrough of Potential Issues in etcd

We’ve now covered the foundational components that form etcd’s robust architecture. However, each of the sub-systems can undergo malfunction that could disrupt etcd’s performance. Understanding these potential issues is key for SREs to put in place preemptive measures and effective troubleshooting.

While a comprehensive exploration of these issues would be best suited for a dedicated blog post, I’ll provide a high-level walkthrough of common issues in etcd.

etcd Internal Components Potential Problems
Raft

Problems revolve around leader elections:

  • Consensus delays or failures
  • Election timeouts
  • Frequent leader elections
  • Heartbeat failures
WAL (Write-Ahead Log)

Problems revolve around the fsync() operation in etcd which is responsible for writing information to disk:

  • fsync Latency
  • WAL grows excessively large
  • File System Overheads
  • OS-level write and cache settings
BoltDB

Problems around storage inefficiencies, disk space utilization or corruption:

  • Data fragmentation within the storage file
  • Uncontrolled Growth of BoltDB
  • Slow disks and disk activity
gRPC

Problems revolve around communication bottlenecks between nodes.

  • slow network latency or packets being dropped between etcd members
  • Slow disk performance, CPU starvation, or slow network can also cause issues with applying entries in etcd

Key Takeaways on etcd in Kubernetes

Perfect fit: We’ve seen that etcd is a perfect fit for Kubernetes. It is the single-source-of-truth for Kubernetes’ clusters state – configuration data, state information, and metadata. etcd offers strong consistency, high availability through the Raft consensus algorithm, and the ability to stream changes to clients efficiently. All of these make etcd an ideal match for Kubernetes’ distributed system architecture.

etcd’s moving parts can trip you up: However, etcd has many moving parts which can trip up SREs. Performance issues can originate from any of etcd’s sub-systems. These varied issues could range from leader election inconsistencies, high latencies between nodes, network partitions, Fsync issues to the etcd database going out of space. Each of these could put the Kubernetes cluster stability at risk. No new changes can be made, and new pods cannot be scheduled.

Important to monitor etcd health: SREs need to build a strong foundational knowledge of etcd’s inner sub-systems. This can help you better understand where issues originate from and help detect and address problems before they impact the Kubernetes cluster.

Next Steps – Key Observability Signals to Monitor etcd

As an SRE, your next step would be to get a more in-depth view of etcd’s internals and the issues that arise from each of etcd’s sub-systems.

In our forthcoming blog series, we’ll delve deeper into common etcd problems, explore key performance metrics, and discuss strategies for effective monitoring and troubleshooting.

eG Enterprise is an Observability solution for Modern IT. Monitor digital workspaces,
web applications, SaaS services, cloud and containers from a single pane of glass.

Further reading

About the Author

Arun is Head of Products, Container & Cloud Performance Monitoring at eG Innovations. Over a 20+ year career, Arun has worked in roles including development, architecture and ops across multiple verticals such as banking, e-commerce and telco. An early adopter of APM products since the mid 2000s, his focus has predominantly been on performance tuning and monitoring of large-scale distributed applications.