Redis Cluster Availability Test

Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.

Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).

Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what we call a hash slot.

Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where:

Node A contains hash slots from 0 to 5500.
Node B contains hash slots from 5501 to 11000.
Node C contains hash slots from 11001 to 16383.

This allows to add and remove nodes in the cluster easily.

In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-slave model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional slaves nodes).

In our example cluster with nodes A, B, C, if node B fails the cluster is not able to continue, since we no longer have a way to serve hash slots in the range 5501-11000.

However when the cluster is created (or at a later time) we add a slave node to every master, so that the final cluster is composed of A, B, C that are master nodes, and A1, B1, C1 that are slave nodes. This way, the system is able to continue if node B fails.

Node B1 replicates B, and B fails, the cluster will promote node B1 as the new master and will continue to operate correctly.

However, note that if nodes B and B1 fail at the same time, Redis Cluster is not able to continue to operate.

To avoid this, administrators must monitor the Redis cluster, understand how many master nodes it is composed of, track the status of hash lots assigned to each node, and be promptly alerted if any hash slot fails. For achieving this, administrators can use the Redis Cluster Availability Test.

For a cluster-enabled Redis instance, this test reports the composition of the cluster in terms of the number of master nodes and hash slots assigned to the cluster. In addition, the test tracks the status of the hash slots, and notifies administrators if any hash slot fails. Moreover, the test also alerts administrators if any node is added or removed from the cluster.

Target of the test :A Redis Cluster

Agent deploying the test : An external agent

Outputs of the test : One set of results for the cluster-enabled instance being monitored.

Configurable parameters for the test
Parameters	Description
Test period	How often should the test be executed
Host	The host for which the test is to be configured.
Port	The port at which the specified HOST listens.
Username	This parameter is applicable only if the target server is Redis v6 or above. The eG agent has to be configured with the credentials of a user who has Read-only privileges on the monitored Redis Cluster Server. To create a user named eguser,run the following command. acl setuser <username> on ><passowrd> allcommands allkeys For example, to create a eguser with read only privilege run the following command:acl setuser eguser allkeys -@all +client\|list +cluster\|info +cluster\|nodes +config\|get +info +memory\|usage +ping +scan +slowlog\|get +time +ttl +xinfo\|groups +xinfo\|stream resetchannels on >password
Redis Password and Confirm Password	If the target server is Redis v6 or above, then specify the password that corresponds to the above-mentioned Username in this text box. In some high security environments, a password may have been set for the Redis server(before v6), so as to protect it from unauthorized accesses/abuse. If such a password has been set for the monitored Redis server, then specify that password against REDIS PASSWORD. Then, confirm the password by retyping it against CONFIRM PASSWORD. If the Redis server is not password protected, then do not disturb the default setting of this parameter. To determine whether/not the target Redis server is password-protected, do the following: Login to the system hosting the Redis server. Open the redis.conf file in the <REDIS_INSTALL_DIR>. Look for the requirepass parameter in the file. If this parameter exists, and is not preceded by a # (hash) symbol, it means that password protection is enabled for the Redis server. In this case, the string that follows the requirepass parameter is the password of the Redis server. For instance, say that the requirepass specification reads as follows: requirepass red1spr0 According to this specification, the Redis server is protected using the password red1spr0. In this case therefore, you need to specify red1spr0 against REDIS PASSWORD. On the other hand, if the requirepass parameter is prefixed by the # (hash) symbol as shown below, it means password protection is disabled. # requirepass red1spr0 In this case, leave the REDIS PASSWORD parameter with its default setting.
SSL	By default, the SSL flag is set to No, indicating that the target Redis cluster server is not SSL-enabled by default. To enable the test to connect to an SSL-enabled Redis cluster server, set the SSL flag to Yes.
Cluster Nodes	By default, the Cluster Nodes parameter is set to auto-discover, indicating that, by default, this test will auto-discover the nodes available in the cluster and report metrics for all the discovered nodes. However, in case of node failures the eG agent needs to connect to any of the available nodes and collect metrics. To this effect, provide a comma-separated list of nodes in this text box. You should specify the nodes in the following format: HOSTNAME1#PORT1,HOSTNAME2#PORT2,... . For example, 172.16.8.81#30071,172.16.8.81#30072,....
DD Frequency	Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency.
Detailed Diagnosis	To make diagnosis more efficient and accurate, the eG Enterprise suite embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Cluster enabled for this instance?

Indicates whether/not the cluster feature is enabled for the target Redis instance.

If the instance is cluster-enabled, then this measure will report the value Yes. For a cluster-disabled instance, this measure will report the value No.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure reports the Measure Values listed in the table above to indicate whether/not the target instance is cluster-enabled. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Hash slots in OK state

Indicates the number of hash slots in the cluster that are in the OK state.

Number

If the value of this measure is the same as the value of the Number of hash slots assigned to cluster measure, it means that all hash slots mapped to all nodes in the cluster are working correctly.

On the other hand, if the value of this measure is much lower than the value of the Number of hash slots assigned to cluster measure, it means that the hash slots mapped to some nodes are in the FAIL or PFAIL state. You may want to look up the values of the Number of hash slots in PFAIL state and Number of hash slots in FAIL state measures to confirm this.

Hash slots in PFAIL state

Indicates the number of hash slots that are mapped to a node in PFAIL state.

Number

Ideally, the value of this measure should be very low or 0.

A node flags another node with the PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time. Both master and slave nodes can flag another node as PFAIL, regardless of its type.

Note that those hash slots still work correctly, as long as the PFAIL state is not promoted to FAIL by the failure detection algorithm. PFAIL only means that we are currently not able to talk with the node, but may be just a transient error.

Messages sent via the cluster node-to-node

Indicates the number of messages sent via the cluster node-to-node binary bus.

Number

All the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. Every node is connected to every other node in the cluster using the cluster bus. Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions. The cluster bus is also used in order to propagate Pub/Sub messages across the cluster and to orchestrate manual failovers when requested by users (manual failovers are failovers which are not initiated by the Redis Cluster failure detector, but by the system administrator directly).

Messages received via the cluster node-to-node

Indicates the number of messages received via the cluster node-to-node binary bus.

Number

Cluster state

Indicates the current state of the cluster.

This measure can report any of the following values:

OK: If the node is able to receive queries, then this measure will report the value OK.
Fail: If there is at least one hash slot that is unbound (no node associated), in error state ((node serving it is flagged with FAIL flag), or if the majority of masters can't be reached by this node, then this measure will report the value FAIL.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Fail	0
OK	1

Note:

This measure reports the Measure Values listed in the table above to indicate the cluster state. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Nodes in the cluster

Indicates the number of nodes in the cluster.

Number

To know the details of the nodes in the cluster, use the detailed diagnosis of this measure.

Hash slots assigned to cluster

Indicates the total number of hash slots assigned to the cluster.

Number

Master nodes in the cluster

Indicates the number of master nodes in the cluster.

Number

Hash slots in FAIL state

Indicates the number of hash slots that are mapped to a node in FAIL state.

Number

Every node sends gossip messages to every other node including the state of a few random known nodes. Every node eventually receives a set of node flags for every other node. This way every node has a mechanism to signal other nodes about failure conditions they have detected.

A PFAIL condition is escalated to a FAIL condition when the following set of conditions are met:
Some node, say node A, has another node B flagged as PFAIL
Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster.
The majority of masters signaled the PFAIL or FAIL condition within NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT time. (The validity factor is set to 2 in the current implementation, so this is just two times the NODE_TIMEOUT time).

If all the above conditions are true, Node A will:

Mark the node as FAIL.
Send a FAIL message to all the reachable nodes.

Ideally therefore, the value of this measure should be 0.

Nodes added to the cluster

Indicates the number of nodes added to the cluster.

Number

Use the detailed diagnosis of this measure to know which nodes were recently added to the cluster.

Nodes deleted from the cluster

Indicates the number of nodes deleted from the cluster.

Number

Use the detailed diagnosis of this measure to know which nodes were recently deleted from the cluster.

Slave nodes in the cluster

Indicates the number of slave nodes in the cluster.

Number