Redis Replication Test
At the base of Redis replication there is a very simple to use and configure leader follower (master-slave) replication: it allows replica Redis instances to be exact copies of master instances. The replica will automatically reconnect to the master every time the link breaks, and will attempt to be an exact copy of it regardless of what happens to the master.
This system works using three main mechanisms:
-
When a master and a replica instances are well-connected, the master keeps the replica updated by sending a stream of commands to the replica, in order to replicate the effects on the dataset happening in the master side due to: client writes, keys expired or evicted, any other action changing the master dataset.
-
When the link between the master and the replica breaks, for network issues or because a timeout is sensed in the master or the replica, the replica reconnects and attempts to proceed with a partial resynchronization: it means that it will try to just obtain the part of the stream of commands it missed during the disconnection.
-
When a partial resynchronization is not possible, the replica will ask for a full resynchronization. This will involve a more complex process in which the master needs to create a snapshot of all its data, send it to the replica, and then continue sending the stream of commands as the dataset changes.
Every Redis master has a replication ID: it is a large pseudo random string that marks a given story of the dataset. Each master also takes an offset that increments for every byte of replication stream that is produced to be sent to replicas, in order to update the state of the replicas with the new changes modifying the dataset. The replication offset is incremented even if no replica is actually connected, so basically every given pair of Replication ID, offset, identifies an exact version of the dataset of a master.
When the master-replica link goes down for some reason, replicas automatically reconnect with the masters and typically, continue the replication process without requiring a full resynchronization. This works by creating an in-memory backlog of the replication stream on the master side. When replicas connect to masters, they use the PSYNC command in order to send their old master replication ID and the offsets they processed so far. If the master run ID is still the same, and the offset specified is in the replication backlog on the master side, then replication will resume from the point where it left off. However if there is not enough backlog in the master buffers, or if the replica is referring to an history (replication ID) which is no longer known, then a full resynchronization happens: in this case the replica will get a full copy of the dataset, from scratch.
In a full master-slave synchronization, the master starts a background saving process in order to produce an RDB file. At the same time it starts to buffer all new write commands received from the clients. When the background saving is complete, the master transfers the database file to the replica, which saves it on disk, and then loads it into memory. The master will then send all buffered commands to the replica. This is done as a stream of commands.
The success of any replication system rests on how quickly slaves reconnect with master when the link goes down, and how rapidly data synchronization occurs between the masters and the slaves. If slaves take too long to reconnect with the master after losing contact, or if the replication backlog on the master side is not sized with enough memory to hold the replication streams, then the replication process will be sluggish. To avoid this, administrators should continuously monitor the steps in the replication process, proactively identify pain points, and promptly initiate measures to eliminate them, so that the dataset on the master and slave sides are in-sync at all times. This is where the Redis Replication test helps!
This test first determines whether the target server is a master or slave in the replication process. For a master, the test reports the number of slaves connected to that master and the master's replication offset. Additionally, the test also monitors the usage of the master's replication backlog, and alerts administrators if the backlog is not sized commensurate to its usage. If the target server is a slave, then the test reports the details of the master to which the slave connects. The health of the master-slave link is periodically checked, and link failures (if any) are brought to the immediate attention of administrators. Alerts are also sent out if the slave has not reconnected with the master long after the loss of communication with the master. With the help of these metrics, administrators can quickly spot anomalies in the replication process and initiate measures to resolve them.
Target of the test :A Redis server
Agent deploying the test : An internal agent (recommended)
Outputs of the test : One set of results for the target Redis server
Parameters | Description |
---|---|
Test period |
How often should the test be executed |
Host |
The host for which the test is to be configured. |
Port |
The port at which the specified HOST listens. |
Redis Password and Confirm Password |
In some high security environments, a password may have been set for the Redis server, so as to protect it from unauthorized accesses/abuse. If such a password has been set for the monitored Redis server, then specify that password against REDIS PASSWORD. Then, confirm the password by retyping it against CONFIRM PASSWORD. If the Redis server is not password protected, then do not disturb the default setting of this parameter. To determine whether/not the target Redis server is password-protected, do the following:
|
Expected Redis Role |
By default, Expected Redis Role parameter is set to any indicating that this test will not capture the failover of the nodes from master to replica and vice versa. However, when a failover happens, the role of a master node changes to slave and vice versa. If you wish to capture when exactly such failover happened on one/more nodes, then you can provide a comma-separated list of nodes and their original roles (master/slave) in the Expected Redis Role parameter. The format for this configuration is: nodename1#port1:rolename, nodename2#port2:rolename,.... For example, your specification can be: 172.16.8.81#30071:master, 172.16.8.81#30072:slave, 172.16.8.81#30073:slave,.... |
Detailed Diagnosis |
To make diagnosis more efficient and accurate, the eG Enterprise suite embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:
|
Measurement | Description | Measurement Unit | Interpretation | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Redis role |
Indicates the role of this server. |
|
The values that this measure can report and their corresponding numeric values are listed in the table below:
Note: This measure reports the Measure Values listed in the table above to indicate the role of the target server. The graph of this measure however, indicates the same using the numeric equivalents only. |
||||||||
Is Redis Role of this node different from the configuration? |
Indicates whether/not the role of this node changed from the configured role in the EXPECTED REDIS ROLE parameter. |
|
The numeric values that correspond to these measure values are discussed in the table below:
Note: This measure will not be reported if Expected redis role parameter is set to any. This measure reports the Measure Values listed in the table above to indicate whether/not the role of this node changed from the role specified in the test configuration. However, in the graph, this measure is indicated using the Numeric Values listed in the table above. Use the detailed diagnosis to identify the details of the role change of this node. |
||||||||
Connected slaves |
Indicates the number of slaves connected. |
|
Use the detailed diagnosis of this measure to know which slaves are connected to the master. |
||||||||
Master host |
Indicates the IP address/host name of the master. |
Number |
This measure will report a value only if the Role measure reports the value Slave. |
||||||||
Master port |
Indicates the port number of the master. |
Number |
This measure will report a value only if the Role measure reports the value Slave. |
||||||||
Master link status |
Indicates whether/not the slave is able to connect to the master. |
|
This measure will report a value only if the Role measure reports the value Slave. The values that this measure can report and their corresponding numeric values are listed in the table below:
Note: This measure reports the Measure Values listed in the table above to indicate whether the master-slave link is up or down. The graph of this measure however, indicates the same using the numeric equivalents only. |
||||||||
Number of seconds since master link is down |
Indicates how long it has been since the master link went down. |
Seconds |
This measure will report a value only if the link between the master and slave is down. Ideally, the value of this measure should be low. |
||||||||
Number of seconds since last interaction with master |
Indicates how long it has been since this slave last contacted the master. |
Seconds |
This measure will report a value only if the Role measure reports the value Slave. Ideally, the value of this measure should be lower than the value of the Number of seconds since master link is down measure. |
||||||||
Is master syncing with slave? |
Indicates whether/not this slave is syncing with the master. |
|
This measure will report a value only if the Role measure reports the value Slave. The values that this measure can report and their corresponding numeric values are listed in the table below:
Note: This measure reports the Measure Values listed in the table above to indicate whether the slave is syncing with the master. The graph of this measure however, indicates the same using the numeric equivalents only. |
||||||||
Number of MB left for syncing |
Indicates the amount of data that is yet to be sychronized. |
MB |
Lower the value, better will be replication performance. This measure will be reported only if a SYNC operation is in progress. |
||||||||
Master replication offset |
Indicates the master's replication offset. |
Number |
This measure will report a value only if the Role measure reports the value Master. |
||||||||
Slave replication offset |
Indicates the slave's replication offset |
Number |
This measure will report a value only if the Role measure reports the valueSlave. The value of this measure should be lower than the value of the Master replication offset measure for partial synchronization to occur. |
||||||||
Is replication backlog active due to slave down? |
Indicates whether/not the replication backlog is active. |
|
The values that this measure can report and their corresponding numeric values are listed in the table below:
Note: This measure reports the Measure Values listed in the table above to indicate whether/not the replication backlog is active. The graph of this measure however, indicates the same using the numeric equivalents only. |
||||||||
Replication backlog first byte offset |
Indicates the master offset of the replication backlog buffer. |
Number |
|
||||||||
Replication backlog size limit |
Indicates the size of the replication backlog buffer. |
MB |
|
||||||||
Replication backlog history length |
Indicates the size in bytes of the data in the replication backlog buffer. |
Number |
If the value of this measure is close to the value of the Replication backlog size limit, it implies that the backlog is fast running out of space to accommodate the replication streams. You may want to increase the size of the replication backlog to avoid this. |