Redis Cluster Replication Test

At the base of Redis replication there is a very simple to use and configure leader follower (master-slave) replication: it allows replica Redis instances to be exact copies of master instances. The replica will automatically reconnect to the master every time the link breaks, and will attempt to be an exact copy of it regardless of what happens to the master.

This system works using three main mechanisms:

When a master and a replica instances are well-connected, the master keeps the replica updated by sending a stream of commands to the replica, in order to replicate the effects on the dataset happening in the master side due to: client writes, keys expired or evicted, any other action changing the master dataset.
When the link between the master and the replica breaks, for network issues or because a timeout is sensed in the master or the replica, the replica reconnects and attempts to proceed with a partial resynchronization: it means that it will try to just obtain the part of the stream of commands it missed during the disconnection.
When a partial resynchronization is not possible, the replica will ask for a full resynchronization. This will involve a more complex process in which the master needs to create a snapshot of all its data, send it to the replica, and then continue sending the stream of commands as the dataset changes.

Every Redis master has a replication ID: it is a large pseudo random string that marks a given story of the dataset. Each master also takes an offset that increments for every byte of replication stream that is produced to be sent to replicas, in order to update the state of the replicas with the new changes modifying the dataset. The replication offset is incremented even if no replica is actually connected, so basically every given pair of Replication ID, offset, identifies an exact version of the dataset of a master.

When the master-replica link goes down for some reason, replicas automatically reconnect with the masters and typically, continue the replication process without requiring a full resynchronization. This works by creating an in-memory backlog of the replication stream on the master side. When replicas connect to masters, they use the PSYNC command in order to send their old master replication ID and the offsets they processed so far. If the master run ID is still the same, and the offset specified is in the replication backlog on the master side, then replication will resume from the point where it left off. However if there is not enough backlog in the master buffers, or if the replica is referring to an history (replication ID) which is no longer known, then a full resynchronization happens: in this case the replica will get a full copy of the dataset, from scratch.

In a full master-slave synchronization, the master starts a background saving process in order to produce an RDB file. At the same time it starts to buffer all new write commands received from the clients. When the background saving is complete, the master transfers the database file to the replica, which saves it on disk, and then loads it into memory. The master will then send all buffered commands to the replica. This is done as a stream of commands.

The success of any replication system rests on how quickly slaves reconnect with master when the link goes down, and how rapidly data synchronization occurs between the masters and the slaves. If slaves take too long to reconnect with the master after losing contact, or if the replication backlog on the master side is not sized with enough memory to hold the replication streams, then the replication process will be sluggish. To avoid this, administrators should continuously monitor the steps in the replication process, proactively identify pain points, and promptly initiate measures to eliminate them, so that the dataset on the master and slave sides are in-sync at all times. This is where the Redis Cluster Replication test helps!

This test first determines whether the target server is a master or slave in the replication process. For a master, the test reports the number of slaves connected to that master and the master's replication offset. Additionally, the test also monitors the usage of the master's replication backlog, and alerts administrators if the backlog is not sized commensurate to its usage. If the target server is a slave, then the test reports the details of the master to which the slave connects. The health of the master-slave link is periodically checked, and link failures (if any) are brought to the immediate attention of administrators. Alerts are also sent out if the slave has not reconnected with the master long after the loss of communication with the master. With the help of these metrics, administrators can quickly spot anomalies in the replication process and initiate measures to resolve them.

Target of the test :A Redis Cluster

Agent deploying the test : An external agent

Outputs of the test : One set of results for each master/slave node grouped under each prefix/group name in the target Redis cluster being monitored.

First-Level Descriptor: Group prefix

Second-Level Descriptor: Node name

Configurable parameters for the test
Parameters	Description
Test period	How often should the test be executed
Host	The host for which the test is to be configured.
Port	The port at which the specified HOST listens.
Username	This parameter is applicable only if the target server is Redis v6 or above. The eG agent has to be configured with the credentials of a user who has Read-only privileges on the monitored Redis Cluster Server. To create a user named eguser,run the following command. acl setuser <username> on ><passowrd> allcommands allkeys For example, to create a eguser with read only privilege run the following command:acl setuser eguser allkeys -@all +client\|list +cluster\|info +cluster\|nodes +config\|get +info +memory\|usage +ping +scan +slowlog\|get +time +ttl +xinfo\|groups +xinfo\|stream resetchannels on >password
Redis Password and Confirm Password	If the target server is Redis v6 or above, then specify the password that corresponds to the above-mentioned Username in this text box. In some high security environments, a password may have been set for the Redis server(before v6), so as to protect it from unauthorized accesses/abuse. If such a password has been set for the monitored Redis server, then specify that password against REDIS PASSWORD. Then, confirm the password by retyping it against CONFIRM PASSWORD. If the Redis server is not password protected, then do not disturb the default setting of this parameter. To determine whether/not the target Redis server is password-protected, do the following: Login to the system hosting the Redis server. Open the redis.conf file in the <REDIS_INSTALL_DIR>. Look for the requirepass parameter in the file. If this parameter exists, and is not preceded by a # (hash) symbol, it means that password protection is enabled for the Redis server. In this case, the string that follows the requirepass parameter is the password of the Redis server. For instance, say that the requirepass specification reads as follows: requirepass red1spr0 According to this specification, the Redis server is protected using the password red1spr0. In this case therefore, you need to specify red1spr0 against REDIS PASSWORD. On the other hand, if the requirepass parameter is prefixed by the # (hash) symbol as shown below, it means password protection is disabled. # requirepass red1spr0 In this case, leave the REDIS PASSWORD parameter with its default setting.
SSL	By default, the SSL flag is set to No, indicating that the target Redis cluster server is not SSL-enabled by default. To enable the test to connect to an SSL-enabled Redis cluster server, set the SSL flag to Yes.
Cluster Nodes	By default, the Cluster Nodes parameter is set to auto-discover, indicating that, by default, this test will auto-discover the nodes available in the cluster and report metrics for all the discovered nodes. However, in case of node failures the eG agent needs to connect to any of the available nodes and collect metrics. To this effect, provide a comma-separated list of nodes in this text box. You should specify the nodes in the following format: HOSTNAME1#PORT1,HOSTNAME2#PORT2,... . For example, 172.16.8.81#30071,172.16.8.81#30072,....
Expected Redis Role	By default, Expected Redis Role parameter is set to any indicating that this test will not capture the failover of the nodes from master to replica and vice versa. However, when a failover happens, the role of a master node changes to slave and vice versa. If you wish to capture when exactly such failover happened on one/more nodes, then you can provide a comma-separated list of nodes and their original roles (master/slave) in the Expected Redis Role parameter. The format for this configuration is: nodename1#port1:rolename, nodename2#port2:rolename,.... For example, your specification can be: 172.16.8.81#30071:master, 172.16.8.81#30072:slave, 172.16.8.81#30073:slave,....
Master Slave Group	Often, administrators find it difficult to identify the problematic nodes at a single glance in a cluster, in environments where hundreds of nodes are available. To easily identify the problematic nodes, administrators are allowed to group the nodes into master and its respective replicas under a group name (prefix). To this effect, the Master Slave Group flag is set to Yes, by default indicating that this test will report metrics for each group name (prefix):node name combination i.e., the descriptor of this test will be group name (prefix):node name. If you do not wish to group the nodes, then, set this flag to No in which case, the descriptor of the test will be the nodes.
Master Slave Group Prefix	The Master Slave Group Prefix parameter is applicable only if the Master Slave Group flag is set to YES. In this text box, specify the name of the prefix under which the nodes should be grouped. For example, if the nodes are to be grouped with a prefix Shard, then, the descriptors of this test will be displayed in the following format: Shard1:<comma-separated list of node names>, Shard2: <comma-separated list of node names>, ...
Allowed Replication Time Lag Limit	By default, Allowed Replication Time Lag Limit is set as 30 seconds. This means that, by default this test will report the Replication lag time measure for the slaves that are lagging behind the master by 30 seconds and more. You can override this default setting by specified a different cut-off value here.
DD Frequency	Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency.
Detailed Diagnosis	To make diagnosis more efficient and accurate, the eG Enterprise suite embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Redis Role

Indicates the role of this node.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Sentinel	0
Master	1
Slave	2

Note:

This measure reports the Measure Values listed in the table above to indicate the role of the target node. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Is Redis Role of this node different from the configuration?

Indicates whether/not the role of this node changed from the configured role in the EXPECTED REDIS ROLE parameter.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure will not be reported if Expected redis role parameter is set to any.

This measure reports the Measure Values listed in the table above to indicate whether/not the role of this node changed from the role specified in the test configuration. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Use the detailed diagnosis to identify the details of the role change of this node.

Number of connected slaves

Indicates the number of slaves connected to this node.

Number

Use the detailed diagnosis of this measure to know which slaves are connected to the master.

Master link status

Indicates whether/not the this node is able to connect to the master.

Number

This measure will report a value only if the Role measure reports the value Slave.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Down	0
Up	1

Note:

This measure reports the Measure Values listed in the table above to indicate whether the master-slave link is up or down. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Number of seconds since the master link is down

Indicates how long it has been since the master link went down.

Seconds

This measure will report a value only if the link between the master and slave is down.

Ideally, the value of this measure should be low.

Number of seconds since the last interaction with master

Indicates how long it has been since this node last contacted the master.

Seconds

This measure will report a value only if the Role measure reports the value Slave.

Ideally, the value of this measure should be lower than the value of the Number of seconds since master link is down measure.

Is master syncing with slave?

Indicates whether/not this node is syncing with the master.

This measure will report a value only if the Role measure reports the value Slave.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure reports the Measure Values listed in the table above to indicate whether the slave is syncing with the master. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Number of MB left for syncing

Indicates the amount of data that is yet to be sychronized on this node.

Lower the value, better will be replication performance.

This measure will be reported only if a SYNC operation is in progress.

Master replication offset

Indicates this node's master replication offset.

Number

This measure will report a value only if the Role measure reports the value Master.

Slave replication offset

Indicates this node's slave replication offset.

Number

This measure will report a value only if the Role measure reports the value Slave. The value of this measure should be lower than the value of the Master replication offset measure for partial synchronization to occur.

Is replication backlog active?

Indicates whether/not the replication backlog is active on this node.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure reports the Measure Values listed in the table above to indicate whether/not the replication backlog is active. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Replication backlog first byte offset

Indicates the master offset of the replication backlog buffer on this node.

Number

Replication backlog size limit

Indicates the size of the replication backlog buffer on this node.

Number

Replication backlog history length

Indicates the size in bytes of the data in the replication backlog buffer on this node.

Number

If the value of this measure is close to the value of the Replication backlog size limit, it implies that the backlog is fast running out of space to accommodate the replication streams. You may want to increase the size of the replication backlog to avoid this.

Replication offset lag

Indicates the difference between master replication offset and slave replication offset.

Number

This measure will report a value only if the Role measure reports the value Slave. The value of this measure helps to identify the replication delay between the master and this slave. The value of this measure should not be greater than 0. By default, this measure will report if the number is greater than zero. You can change this default setting by modifying the thresholds of the Replication offset lag measure using the eG admin interface.

The value of this measure helps to identify the replication delay between the master and this slave. If the number is big, partial data loss can happen in case of abnormal master shutdown. But in production environments, occurrence of occasional spikes in lag due to load is normal and are tolerable. However, regular spikes in replication lag is a cause for concern. In such cases, setting static threshold values can be extremely laborious and nearly impossible. Hence, it is advisable to configure an automatic threshold for this measure where eG make use of historical data to compute an auto baseline and trigger alert based on that.

Replication lag time

Indicates time difference of execution between the primary/master against the standby/slave node, ie. the time that the replica's state is lagging behind the state of the primary instance.

Seconds

This measure will report a value only if the Role measure reports the value Slave.This measure is reported if the replication lag time is higher than the time configured in Allowed Replication Time Lag Limit.

A high replication lag can be due to bad queries being replicated such as lack of primary keys or bad indexes, a poor network hardware or malfunctioning network card, a distant location between different regions or zones, or some processes such as physical backups running can cause your Redis database to delay applying the current replicated transaction.

Slaves with replication problem

Indicates the number of slaves experiencing replication problems on this node.

Number

This measure will report a value only if the Role measure reports the value Master. This measure indicates the number of slaves of the target master node that experiences any replication lag or any other issues. Compare the value of this measure across all the slaves to find out the slave with maximum replication lag.

The detailed diagnosis of this measure lists Name, IP Address, Port Number, State, Offset(number), and Lag(secs) of the connected slaves.