K8s Garbage Collections Test

The Kubernetes project is written in the Go programming language (also known as Golang). Go is a statically typed, compiled programming language designed at Google. Go is syntactically similar to C, but with memory safety, garbage collection, structural typing, and communicating sequential processes (CSP)-style concurrency.

Garbage collectors have the responsibility of tracking heap memory allocations, freeing up allocations that are no longer needed, and keeping allocations that are still in-use. The Go programming language uses a non-generational concurrent tri-color mark and sweep collector.

When a collection starts, the collector runs through four phases of work:

  • Mark Setup
  • Marking
  • Mark Termination
  • Sweeping

The Mark Setup phase is where the Write Barrier is turned on. The purpose of the Write Barrier is to allow the collector to maintain data integrity on the heap during a collection since both the collector and application goroutines will be running concurrently. In order to turn the Write Barrier on, every application goroutine running must be stopped. The only way to do that is for the collector to watch and wait for each goroutine to make a function call. Function calls guarantee the goroutines are at a safe point to be stopped.

Once the Write Barrier is turned on, the collector commences with the Marking phase. The Marking phase consists of marking values in heap memory that are still in-use. This work starts by inspecting the stacks for all existing goroutines to find root pointers to heap memory. Then the collector must traverse the heap memory graph from those root pointers. The first thing the collector does in this phase is take 25% of the available CPU capacity for itself. For example, if an application uses 4 CPUs, then the collector will hog an entire CPU while at this phase. In this case typically, the collector will use the 25% CPU capacity that it has set aside for this phase, to do the marking work, allowing normal application work to continue on the remaining 75%.

Once the Marking work is done, the next phase is Mark Termination. This is when the Write Barrier is turned off, various clean up tasks are performed, and the next collection goal is calculated.

Once the collection is finished, the full CPU capacity is released for the use of the application Goroutines again, thus bringing the application back to full throttle.

Sweeping typically happens after the collection is finished. Sweeping is when the memory associated with values in heap memory that were not marked as in-use are reclaimed. This activity occurs when application Goroutines attempt to allocate new values in heap memory.

In summary, by performing garbage collection, Golang ensures that applications make optimal use of available heap memory. While this improves application performance at one end, at the other, every collection also inflicts certain latencies on the running application that may slow down application work. For instance, at the Mark Setup phase, the garbage collector stops all application Goroutines, so it can turn on the Write Barrier. This imposes a Stop the World (STW) latency on the running application. Likewise, the application Goroutines are stopped at the Mark Termination phase as well, once again inflicting an STW latency on the applications. Also, sometimes, garbage collection steals CPU capacity to stay alive, and degrades application performance in the bargain. For instance, in the Marking phase, if the Goroutine dedicated to the collector is unable to finish the marking work before the heap memory in-use reaches its limit, the collector will recruit the application Goroutines to assist with the Marking work. This is called a Mark Assist. When this happens, the application will be forced to compete with the collector for the available CPU resources. This contention can occasionally choke application performance! 

To optimize garbage collection and eliminate its ill effects, administrators must ensure that the collector does more work, while consuming minimum time and resources. For this purpose, administrators must first study the garbage collection activity closely, and figure out how much time and resources the collector typically invests in this process. This is where, the Kube Garbage Collection test helps!

This test monitors the garbage collection activity of Golang, and reports the time the Golang collector spends collecting garbage. Administrators will be alerted if too much time is being spent in garbage collection. The test also reveals the number of threads and Goroutines presently engaged in garbage collection, thus revealing how resource-intensive the garbage collection is. This way, the test enables administrators to periodically review the garbage collection activity, assess its impact on application performance, and figure out if it needs to be fine-tuned to reduce application latencies.

Target of the test : A Kubernetes/OpenShift Cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of results for the Kubernetes/OpenShift cluster being monitored

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the host for which this test is to be configured.

Port

Specify the port at which the specified Host listens. By default, this is 6443.

Load Balancer / Master Node IP

To run this test and report metrics, the eG agent needs to connect to the Kubernetes API on the master node and run API commands. To enable this connection, the eG agent has to be configured with either of the following:

  • If only a single master node exists in the cluster, then configure the eG agent with the IP address of the master node.
  • If the target cluster consists of more than one master node, then you need to configure the eG agent with the IP address of the load balancer that is managing the cluster. In this case, the load balancer will route the eG agent's connection request to any available master node in the cluster, thus enabling the agent to connect with the API server on that node, run API commands on it, and pull metrics.

By default, this parameter will display the Load Balancer / Master Node IP that you configured when manually adding the Kubernetes/OpenShift cluster for monitoring, using the Kubernetes Cluster Preferences page in the eG admin interface (see Figure 3). The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

Whenever the eG agent runs this test, it uses the IP address that is displayed (by default) against this parameter to connect to the Kubernetes API. If there is any change in this IP address at a later point in time, then make sure that you update this parameter with it, by overriding its default setting.

K8s Cluster API Prefix

By default, this parameter is set to none. Do not disturb this setting if you are monitoring a Kubernetes/OpenShift Cluster.

To run this test and report metrics for Rancher clusters, the eG agent needs to connect to the Kubernetes API on the master node of the Rancher cluster and run API commands. The Kubernetes API of Rancher clusters is of the default format: http(s)://{IP Address of kubernetes}/{api endpoints}. The Server section of the kubeconfig.yaml file downloaded from the Rancher console helps in identifying the Kubernetes API of the cluster. For e.g., https://{IP address of Kubernetes}/k8s/clusters/c-m-bznxvg4w/ is usually the URL of the Kubernetes API of a Rancher cluster.

For the eG agent to connect to the master node of a Rancher cluster and pull out metrics, the eG agent should be made aware of the API endpoints in the Kubernetes API of the Rancher cluster. To aid this, you can specify the API endpoints available in the Kubernetes API of the Rancher cluster against this parameter. In our example, this parameter can be specified as: /k8s/clusters/c-m-bznxvg4w/.

SSL

By default, the Kubernetes cluster is SSL-enabled. This is why, the eG agent, by default, connects to the Kubernetes API via an HTTPS connection. Accordingly, this flag is set to Yes by default.

If the cluster is not SSL-enabled in your environment, then set this flag to No.

Authentication Token

The eG agent requires an authentication bearer token to access the Kubernetes API, run API commands on the cluster, and pull metrics of interest. The steps for generating this token have been detailed in How Does eG Enterprise Monitor a Kubernetes/OpenShift Cluster?

The steps for generating this token for a Rancher cluster has been detailed in How Does eG Enterprise Monitor a Rancher Cluster?

Typically, once you generate the token, you can associate that token with the target Kubernetes cluster, when manually adding that cluster for monitoring using the eG admin interface. The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

By default, this parameter will display the Authentication Token that you provided in the Kubernetes Cluster Preferences page of the eG admin interface, when manually adding the cluster for monitoring (see Figure 3).

Whenever the eG agent runs this test, it uses the token that is displayed (by default) against this parameter only accessing the API and pulling metrics. If for any reason, you generate a new authentication token for the target cluster at a later point in time, then make sure you update this parameter with the change. For that, copy the new token and paste it against this parameter.

Proxy Host

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the IP address of the proxy server here. If no proxy is used, then the default setting -none - of this parameter, need not be changed,

Proxy Port

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the port number at which that proxy server listens here. If no proxy is used, then the default setting -none - of this parameter, need not be changed,

Proxy Username, Proxy Password, Confirm Password

These parameters are applicable only if the eG agent uses a proxy server to connect to the Kubernetes/OpenShift cluster, and that proxy server requires authentication. In this case, provide a valid user name and password against the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box.

If no proxy server is used, or if the proxy server used does not require authentication, then the default setting - none - of these parameters, need not be changed.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Avg GC invocations duration

Indicates the average time spent in garbage collection.

Seconds

A low value is desired for this measure. A very high value or a consistent increase in the value of this measure is a cause for concern, as it indicates that the garbage collector is probably taking too long to complete collections.

Since garbage collection often triggers stop-the-world latencies in applications, prolonged garbage collection activities can adversely impact application availability and performance. In short, the longer GC runs, poorer will be application performance.

One way to reduce GC time, is to fine-tune the configuration option called GC Percentage at runtime. This is set to 100 by default. This value represents a ratio of how much new heap memory can be allocated before the next collection has to start. Setting the GC Percentage to 100 means, based on the amount of heap memory marked as live after a collection finishes, the next collection has to start at or before 100% more new allocations are added to heap memory. You could decide to change the GC Percentage value to something larger than 100. This will increase the amount of heap memory that has to be allocated before the next collection can start, thus delaying the start of the next collection.

On the flip side though, increasing the GC percentage will slow down the pace of the collector. The collector has a pacing algorithm which is used to determine when a collection is to start. The algorithm depends on a feedback loop that the collector uses to gather information about the running application and the stress the application is putting on the heap. Stress can be defined as how fast the application is allocating heap memory within a given amount of time. It’s that stress that determines the pace at which the collector needs to run.

One misconception is thinking that slowing down the pace of the collector is a way to improve performance. In reality though, application performance truly improves only when more work is getting done between collections or during a collection. This can be achieved only by reducing the amount or the number of allocations any piece of work is adding to heap memory.

Increasing the GC percentage in fact, increases the workload of collections by adding more to the heap memory after every collection. In the long run, this may degrade application performance than improve it.

OS threads created

Indicates the number of threads spawned by the garbage collection process.

Number

A large value for this measure is indicative of resource-intensive garbage collections.

Goroutines

Indicates the number of Goroutines used for garbage collection.

Number

An unusually high value for this measure could indicate that the garbage collector is probably recruiting application Goroutines as well to do the Marking work on the collections. This in turn could be because of of Marking workloads that the collector is unable to complete using just its dedicated Goroutines. Such workloads are usually imposed by applications that consume heap memory significantly.