Redis Replication Test

At the base of Redis replication there is a very simple to use and configure leader follower (master-slave) replication: it allows replica Redis instances to be exact copies of master instances. The replica will automatically reconnect to the master every time the link breaks, and will attempt to be an exact copy of it regardless of what happens to the master.

This system works using three main mechanisms:

When a master and a replica instances are well-connected, the master keeps the replica updated by sending a stream of commands to the replica, in order to replicate the effects on the dataset happening in the master side due to: client writes, keys expired or evicted, any other action changing the master dataset.
When the link between the master and the replica breaks, for network issues or because a timeout is sensed in the master or the replica, the replica reconnects and attempts to proceed with a partial resynchronization: it means that it will try to just obtain the part of the stream of commands it missed during the disconnection.
When a partial resynchronization is not possible, the replica will ask for a full resynchronization. This will involve a more complex process in which the master needs to create a snapshot of all its data, send it to the replica, and then continue sending the stream of commands as the dataset changes.

Every Redis master has a replication ID: it is a large pseudo random string that marks a given story of the dataset. Each master also takes an offset that increments for every byte of replication stream that is produced to be sent to replicas, in order to update the state of the replicas with the new changes modifying the dataset. The replication offset is incremented even if no replica is actually connected, so basically every given pair of Replication ID, offset, identifies an exact version of the dataset of a master.

When the master-replica link goes down for some reason, replicas automatically reconnect with the masters and typically, continue the replication process without requiring a full resynchronization. This works by creating an in-memory backlog of the replication stream on the master side. When replicas connect to masters, they use the PSYNC command in order to send their old master replication ID and the offsets they processed so far. If the master run ID is still the same, and the offset specified is in the replication backlog on the master side, then replication will resume from the point where it left off. However if there is not enough backlog in the master buffers, or if the replica is referring to an history (replication ID) which is no longer known, then a full resynchronization happens: in this case the replica will get a full copy of the dataset, from scratch.

In a full master-slave synchronization, the master starts a background saving process in order to produce an RDB file. At the same time it starts to buffer all new write commands received from the clients. When the background saving is complete, the master transfers the database file to the replica, which saves it on disk, and then loads it into memory. The master will then send all buffered commands to the replica. This is done as a stream of commands.

The success of any replication system rests on how quickly slaves reconnect with master when the link goes down, and how rapidly data synchronization occurs between the masters and the slaves. If slaves take too long to reconnect with the master after losing contact, or if the replication backlog on the master side is not sized with enough memory to hold the replication streams, then the replication process will be sluggish. To avoid this, administrators should continuously monitor the steps in the replication process, proactively identify pain points, and promptly initiate measures to eliminate them, so that the dataset on the master and slave sides are in-sync at all times. This is where the Redis Replication test helps!

This test first determines whether the target server is a master or slave in the replication process. For a master, the test reports the number of slaves connected to that master and the master's replication offset. Additionally, the test also monitors the usage of the master's replication backlog, and alerts administrators if the backlog is not sized commensurate to its usage. If the target server is a slave, then the test reports the details of the master to which the slave connects. The health of the master-slave link is periodically checked, and link failures (if any) are brought to the immediate attention of administrators. Alerts are also sent out if the slave has not reconnected with the master long after the loss of communication with the master. With the help of these metrics, administrators can quickly spot anomalies in the replication process and initiate measures to resolve them.

Target of the test :A Redis server

Agent deploying the test : An internal agent (recommended)

Outputs of the test : One set of results for the target Redis server

Configurable parameters for the test
Parameters	Description
Test period	How often should the test be executed
Host	The host for which the test is to be configured.
Port	The port at which the specified HOST listens.
Redis Password and Confirm Password	In some high security environments, a password may have been set for the Redis server, so as to protect it from unauthorized accesses/abuse. If such a password has been set for the monitored Redis server, then specify that password against REDIS PASSWORD. Then, confirm the password by retyping it against CONFIRM PASSWORD. If the Redis server is not password protected, then do not disturb the default setting of this parameter. To determine whether/not the target Redis server is password-protected, do the following: Login to the system hosting the Redis server. Open the redis.conf file in the <REDIS_INSTALL_DIR>. Look for the requirepass parameter in the file. If this parameter exists, and is not preceded by a # (hash) symbol, it means that password protection is enabled for the Redis server. In this case, the string that follows the requirepass parameter is the password of the Redis server. For instance, say that the requirepass specification reads as follows: requirepass red1spr0 According to this specification, the Redis server is protected using the password red1spr0. In this case therefore, you need to specify red1spr0 against REDIS PASSWORD. On the other hand, if the requirepass parameter is prefixed by the # (hash) symbol as shown below, it means password protection is disabled. # requirepass red1spr0 In this case, leave the REDIS PASSWORD parameter with its default setting.
Expected Redis Role	By default, Expected Redis Role parameter is set to any indicating that this test will not capture the failover of the nodes from master to replica and vice versa. However, when a failover happens, the role of a master node changes to slave and vice versa. If you wish to capture when exactly such failover happened on one/more nodes, then you can provide a comma-separated list of nodes and their original roles (master/slave) in the Expected Redis Role parameter. The format for this configuration is: nodename1#port1:rolename, nodename2#port2:rolename,.... For example, your specification can be: 172.16.8.81#30071:master, 172.16.8.81#30072:slave, 172.16.8.81#30073:slave,....
Detailed Diagnosis	To make diagnosis more efficient and accurate, the eG Enterprise suite embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Redis role

Indicates the role of this server.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value	Numeric Value
Sentinel	0
Master	1
Slave	2

Note:

This measure reports the Measure Values listed in the table above to indicate the role of the target server. The graph of this measure however, indicates the same using the numeric equivalents only.

Is Redis Role of this node different from the configuration?

Indicates whether/not the role of this node changed from the configured role in the EXPECTED REDIS ROLE parameter.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure will not be reported if Expected redis role parameter is set to any.

This measure reports the Measure Values listed in the table above to indicate whether/not the role of this node changed from the role specified in the test configuration. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Use the detailed diagnosis to identify the details of the role change of this node.

Connected slaves

Indicates the number of slaves connected.

Use the detailed diagnosis of this measure to know which slaves are connected to the master.

Master host

Indicates the IP address/host name of the master.

Number

This measure will report a value only if the Role measure reports the value Slave.

Master port

Indicates the port number of the master.

Number

This measure will report a value only if the Role measure reports the value Slave.

Master link status

Indicates whether/not the slave is able to connect to the master.

This measure will report a value only if the Role measure reports the value Slave.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value	Numeric Value
Down	0
Up	1

Note:

This measure reports the Measure Values listed in the table above to indicate whether the master-slave link is up or down. The graph of this measure however, indicates the same using the numeric equivalents only.

Number of seconds since master link is down

Indicates how long it has been since the master link went down.

Seconds

This measure will report a value only if the link between the master and slave is down.

Ideally, the value of this measure should be low.

Number of seconds since last interaction with master

Indicates how long it has been since this slave last contacted the master.

Seconds

This measure will report a value only if the Role measure reports the value Slave.

Ideally, the value of this measure should be lower than the value of the Number of seconds since master link is down measure.

Is master syncing with slave?

Indicates whether/not this slave is syncing with the master.

This measure will report a value only if the Role measure reports the value Slave.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value	Numeric Value
Yes	0
No	1

Note:

This measure reports the Measure Values listed in the table above to indicate whether the slave is syncing with the master. The graph of this measure however, indicates the same using the numeric equivalents only.

Number of MB left for syncing

Indicates the amount of data that is yet to be sychronized.

Lower the value, better will be replication performance.

This measure will be reported only if a SYNC operation is in progress.

Master replication offset

Indicates the master's replication offset.

Number

This measure will report a value only if the Role measure reports the value Master.

Slave replication offset

Indicates the slave's replication offset

Number

This measure will report a value only if the Role measure reports the valueSlave.

The value of this measure should be lower than the value of the Master replication offset measure for partial synchronization to occur.

Is replication backlog active due to slave down?

Indicates whether/not the replication backlog is active.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value	Numeric Value
Yes	0
No	1

Note:

This measure reports the Measure Values listed in the table above to indicate whether/not the replication backlog is active. The graph of this measure however, indicates the same using the numeric equivalents only.

Replication backlog first byte offset

Indicates the master offset of the replication backlog buffer.

Number

Replication backlog size limit

Indicates the size of the replication backlog buffer.

Replication backlog history length

Indicates the size in bytes of the data in the replication backlog buffer.

Number

If the value of this measure is close to the value of the Replication backlog size limit, it implies that the backlog is fast running out of space to accommodate the replication streams. You may want to increase the size of the replication backlog to avoid this.