Mongo Cluster Members Test

In MongoDB, clusters can refer to two different architectures. A sharded cluster is a way to scale horizontally by distributing your data across multiple replica sets. When a read or write operation is performed on a collection, the client sends the request to a router (mongos). The router will then validate which shard the data is stored in via the configuration server and send the requests to the specific cluster. Each of the shards would contain its own replica set. A replica set is a group of mongod instances that maintain the same data set. A replica set contains several data bearing nodes and optionally one arbiter node. Of the data bearing nodes, one and only one member is deemed the primary node, while the other nodes are deemed secondary nodes. The primary node receives all write operations. A replica set can have only one primary capable of confirming writes with { w: "majority" } write concern. The primary records all changes to its data sets in its operation log, i.e. oplog. Secondary members replicate this log and apply the operations to their data sets.

When a primary does not communicate with the other members of the set for more than 10 seconds, an eligible secondary will hold an election to elect itself the new primary. The first secondary to hold an election and receive a majority of the members' votes becomes primary. Replication lag is a delay between an operation on the primary and the application of that operation from the oplog to the secondary. Replication lag can be a significant issue and can seriously affect MongoDB replica set deployments. Excessive replication lag makes "lagged" members ineligible to quickly become primary and increases the possibility that distributed read operations will be inconsistent. Hence, when such a switch happens, it is only natural that administrators want to be notified of it, as they may then need to troubleshoot the failure of the primary and bring it back up.

Also, for the failover to be successful, the current primary should be able to access at least the majority of members in the replica set. If not, then the primary will step down and become a secondary, rendering the replica set unable to accept any further writes. To avoid this, administrators should be able to instantly detect the inaccessibility or unavailability of any member in the replica set, and quickly restore it to normalcy. The Mongo Cluster Members Test enables administrators to achieve these goals!

This test auto-discovers the nodes in shard and reports the status of each member node and helps to be promptly alerted if that node stops running or switches to an abnormal state. This test also reports whether/not the target node is down. Using this test, administrators can also detect the whether the target node is primary node or not. In addition, this test reveals the number of primary, secondary, and unavailable nodes. This way, the nodes that are inaccessible/unavailable can be identified. Using this test, administrators can figure out if only very few nodes in the cluster are available to service the client requests to the cluster. This signals a potential overload.

Target of the test : A MongoDB Cluster

Agent deploying the test : An external agent

Outputs of the test : One set of results for each node in the shard on the target cluster being monitored.

Configurable parameters for the test
Parameter

Description

Test period

How often should the test be executed.

Host

The host for which the test is to be configured.

Port

The port number at which the specified host listens.

Cluster Node IPs

By default, the eG agent connects to any of the available nodes in the target MongoDB cluster to collect the required metrics. Therefore, in the Cluster Node IPs text box, provide a comma-separated list of all the nodes in the target cluster. By default, this parameter is set to none. This parameter should be specified in the following format: HOSTNAME1:PORT1,HOSTNAME2:PORT2,.... For example, you can configure this parameter as: ac-zcguoqj-shard-00-00.e5ahzxy.mongodb.net:27017,ac-zcguoqj-shard-00-01.e5ahzxy.mongodb.net:27017.

Database Name

The test connects to a specific Mongo Cluster to run API commands and pull metrics of interest. Specify the name of a database on the target MongoDB Cluster database server being monitored The default value of this parameter is admin.

Username and Password

The eG agent has to be configured with the credentials of a user who has the required privileges to monitor the target MongoDB cluster, if the MongoDB instance is access control enabled. To know how to create such a user, refer to How to monitor access control enabled MongoDB database?If the target MongoDB instance is not access control enabled, then, specify none against the Username and Password parameters.

Confirm Password

Confirm the password by retyping it here.

Authentication Mechanism

Typically, the MongoDB supports multiple authentication mechanisms that users can use to verify their identity. In environments where multiple authentication mechanisms are used, this test enables the users to select the authentication mechanism of their interest using Authentication Mechanism list box. By default, this is set to none. However, you can modify this settings as per the requirement.

SSL

By default, the SSL flag is set to No, indicating that the target MongoDB Cluster server is not SSL-enabled by default. To enable the test to connect to an SSL-enabled MongoDB Cluster server, set the SSL flag to Yes.

CA PEM File

A certificate authority (CA) file contains root and intermediate certificates that are electronically signed to affirm that a public key belongs to the owner named in the certificate. A .pem file is a container that may just include the public certificate or the entire certificate chain (private key, public key and root certificates). If the connection requires server authentication and the server certificate is in the .pem format, then, the target instance presents the CA PEM File that contains the server certificate to its clients to establish the instance's identity. Therefore, you should specify the full path to the CA PEM file available in the target MongoDB cluster in the CA PEM File text box. For example, the location of this file may be: C:\app\openSSL\SSLcert\test-ca.pem.

Client PEM File

If the target instance requires a certificate key file that is in .pem format from the client to verify the client's identity, then, to establish a connection with the target cluster, the eG agent should access the client certificate. For this, specify the full path to the Client PEM file in the Client PEM File text box. For example, the location of this file may be: C:\app\openSSL\SSLcert\test-client.pem.

CA Cert File

This parameter is applicable only if the target MongoDB Cluster is SSL-enabled and CA PEM File parameter is set to none.The certificate file is a public-key certificate following the x.509 standard. It contains information about the identity of the server, such as its name, geolocation, and public key. Each nodes of the target cluster can have individual certificate files or a single certificate can be used to access all the nodes in the cluster. Essentially, it’s a certificate that the server serves to the connecting users to prove that they are what they claim to be. Therefore, specify the full path to the server root certificate or certificate file that is signed by the CA in .crt file format for all/each node in the CA Cert File text box. For example, the location of this file may be: C:\app\eGurkha\JRE\lib\security\mongodb-test-ca.crt. By default, this parameter is set to none.

This parameter specification differs according to the type of cluster and configuration:

  • If the certificate file is available for each node of the MongoDB Cluster then, provide a comma-seperated list of full path to the certificates in CA Cert File text box:

    For example:C:\app\eGurkha\JRE\lib\security\mongodb-test-ca.crt,C:\app\eGurkha\JRE\lib\security\mongodb-test-ca2.crt,C:\app\eGurkha\JRE\lib\security\mongodb-test-ca3.crt

  • Specify the full path to the certificate file of the target MongoDB cluster if a single certificate is used to access all nodes.

    For example: C:\app\eGurkha\JRE\lib\security\mongodb-test-ca.crt

Client Cert File

This parameter is applicable only if the target MongoDB Cluster is SSL-enabled and Client PEM File parameter is set to none.In order to collect metrics from the target MongoDB cluster, the eG agent requires client certificate in .p12 format. Hence, specify the full path to the Client certificate file in .p12 format in the Client Cert File text box. For example, the location of this file may be: C:\app\eGurkha\JRE\lib\security\test-client.p12 . To know how to generate .p12 file from Client PEM file, refer to How to import a Certificate that is in the PEM Format?. By default, this parameter is set to none.

Client Cert Password

Provide the password for .p12 Client certificate file in the Client Cert Password text box.

RS Node Cert Info

By default, the this parameter is set to none. This indicates that the test will report metrics for all the nodes in the cluster. However, you should specify this parameter with a comma-seperated list of the nodes in the cluster, provided, separate certificate file is required to access each node of the MongoDB cluster. The format of this configuration is: HOSTNAME1:PORT1,HOSTNAME2:PORT2,...

For example: 192.168.0.1:27011,192.168.0.2:27012,192.168.0.2:27013

Note:

The nodes should be listed in the same order corresponding to the list of certificates specified against CA Cert File.

AWS Key ID, AWS Secret Key,Confirm Password

If you are monitoring MongoDB clusters hosted on the AWS cloud, the eG agent has to be configured with the AWS AccessKey ID and Secret Key to connect with the AWS cloud and collect the required metrics. Therefore, Specify the AWS Key ID and AWS Secret Key and confirm the password by re-typing it in the Confirm Password text box. To obtain the AWS Access key and secret key, refer to Obtaining AWS Access Key and Obtaining AWS Secret Key.

Atlas URI

MongoDB Atlas is a NoSQL Database-as-a-Service offering in the public cloud. If the target MongoDB Cluster is deployed and managed in MongoDB Atlas, then the eG agent has to be configured with the MongoDB Atlas connection URI,a unique identifier for connecting to a MongoDB replica set or a sharded cluster, in the Atlas URI text box to access the target MongoDB Cluster hosted on Atlas and collect the required metrics.

Report Manager Time

By default, this flag is set to Yes, indicating that, by default, the detailed diagnosis of this test, if enabled, will report the shutdown and reboot times of the device in the manager's time zone.

If this flag is set to No, then the shutdown and reboot times are shown in the time zone of the system where the agent is running(i.e., the system being managed for agent-based monitoring, and the system on which the remote agent is running - for agentless monitoring).

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Is this Primary node?

Indicates whether/not this node is the primary node of the replica set.

 

This measure is not reported for Summary descriptor.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value Numeric Value
Yes 1
No 0

Note:

By default, the measure reports only the Measure Values provided in the table above to indicate whether/not a node is the primary node. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Node health

Indicates the health of this node.

 

This measure is not reported for Summary descriptor.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value Numeric Value
Down 0
Up 1

Note:

By default, the measure reports only the Measure Values provided in the table above to indicate whether/not a node is down. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Node state

Indicates the current status of this node.

 

This measure is not reported for Summary descriptor.

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value Description Numeric Value
STARTUP Not yet an active member of any set. All members start up in this state. The mongod parses the replica set configuration document while in STARTUP. 0
PRIMARY The member in state primary is the only member that can accept write operations. Eligible to vote. 1
SECONDARY A member in state secondary is replicating the data store. Eligible to vote. 2
RECOVERING Members either perform startup self-checks, or transition from completing a rollback or resync. Eligible to vote. 3
STARTUP2 The member has joined the set and is running an initial sync. 5
UNKNOWN The member’s state, as seen from another member of the set, is not yet known. 6
ARBITER Arbiters do not replicate data and exist solely to participate in elections.An arbiter does not have a copy of data set and cannot become a primary. Replica sets may have arbiters to add a vote in elections for primary. Arbiters always have exactly 1 election vote, and thus allow replica sets to have an uneven number of voting members without the overhead of an additional member that replicates data. 7
DOWN The member, as seen from another member of the set, is unreachable. 8
ROLLBACK This member is actively performing a rollback. A rollback reverts write operations on a former primary when the member rejoins its replica set after a failover.Data is not available for reads. 9
REMOVE This member was once in a replica set but was subsequently removed. 10

Note:

By default, the measure reports only the Measure Values provided in the table above to indicate the current status of a node. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Total Number of Nodes

Indicates the total number of nodes in this cluster.

Number

This measure is reported only for the Summary descriptor.

Number of Primary Nodes

Indicates the number of primary nodes connected to the cluster.

Number

This measure is reported only for the Summary descriptor.

Number of Secondary Nodes

Indicates the number of secondary nodes connected to the cluster.

Number

This measure is reported only for the Summary descriptor.

Number of Available Nodes

Indicates the number of nodes in the cluster that are currently accessible.

Number

This measure is reported only for the Summary descriptor.

Number of Unavailable Nodes

Indicates the number of nodes in the cluster that are currently unavailable.

Number

This measure is reported only for the Summary descriptor.

A low value is desired for this measure.

Health state

Indicates the current health state of this cluster.

 

This measure is reported only for the Summary descriptor.

The values reported by this measure and their numeric equivalents are available in the table below:

Measure Value Description Numeric Value
Good This state represents all the nodes are available and atleast one node among them is read – write node. 2
Degraded This state represents atleast one read – write node available and one/more node is/are down 1
Critical This state represents if no read – write nodes are available or all the nodes are down 0

Note:

By default, this measure reports current health of the cluster. The graph of this measure however, is represented using the numeric equivalents only - 0 to 2.

Has the role of this primary node changed?

Indicates whether/not the role of this node has switched, i.e., whether a primary node has switched over to secondary node and vice versa.

 

The values reported by this measure and their numeric equivalents are available in the table below:

Measure Value Numeric Value
No 0
Yes 1

Note:

This measure reports the Measure Values listed in the table above to indicate to indicate whether/not role of the node has switched. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

For the Summary descriptor this measure will report whether/not the role of any node in the cluster has switched.

Is this node in read-write mode?

Indicates whether/not this node is in read-write mode.

 

This measure is not reported for Summary descriptor.

The values reported by this measure and their numeric equivalents are available in the table below:

Measure Value Numeric Value
No 0
Yes 1

Note:

This measure reports the Measure Values listed in the table above to indicate to indicate to indicate whether/not the node is in read-write mode. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.

Number of startup nodes

Indicates the number of nodes in the cluster that are currently in Startup state.

Number

This measure is reported only for the Summary descriptor.

Number of recovering nodes

Indicates the number of nodes in the cluster that are currently in recovering state.

Number

This measure is reported only for the Summary descriptor.

Number of startup2 nodes

Indicates the number of nodes in the cluster that are currently in Startup2 state.

Number

This measure is reported only for the Summary descriptor.

Number of unknown nodes

Indicates the number of nodes in the cluster that are currently in unknown state.

Number

This measure is reported only for the Summary descriptor.

Number of arbiter nodes

Indicates the number of arbiter nodes in the cluster.

Number

This measure is reported only for the Summary descriptor.

Number of down nodes

Indicates the number of nodes in the cluster that are currently in down state.

Number

This measure is reported only for the Summary descriptor.

Number of rollback nodes

Indicates the number of nodes in the cluster that are currently in rollback state.

Number

This measure is reported only for the Summary descriptor.

Number of remove nodes

Indicates the number of removed nodes in the cluster.

Number

This measure is reported only for the Summary descriptor.

Number of read-write nodes

Indicates the number of read-write nodes in the cluster.

Number

This measure is reported only for the Summary descriptor.

If the value of this measure is less than 1 then it is a cause for concern.

Number of read-only nodes

Indicates the number of read only nodes in the cluster.

Number

This measure is reported only for the Summary descriptor.

Read-only nodes are replicas of a cluster’s primary node located in additional geographical regions. Using read-only nodes reduces latency for users connecting from those regions.