Troubleshooting DRBD |
This article presents common DRBD problems and solutions. |
DRBD (Distributed Replicated Block Device) runs on the master and slave nodes only, and is responsible for mirroring the contents of a partition between master and slave.
Typical problems in DRBD include:
- A lack of Primary-Secondary connectivity
- The Secondary operating in standalone mode
- Both nodes reporting connectivity but neither one in the role of master
- Both nodes reporting themselves in the role of master
Verify the DRBD status
The following command is used to verify that DRBD is operating normally on the master and slave nodes.
drbd-overview
When run on the master node, the output should look like the following:
1:r0/0 Connected Primary/Secondary UpToDate/UpToDate C r----- /mnt/drbd ...
When run on the slave node, the output should look like the following:
1:r0/0 Connected Secondary/Primary UpToDate/UpToDate C r-----
The following sections are examples of issues found with DRBD and how to resolve them.
Master Node: WFConnection
1:r0/0 WFConnection Primary/Unknown UpToDate/Unknown C r----- /mnt/drbd ...
Summary: The DRBD master node cannot connect to the DRBD slave node:
WFConnection | The master node is waiting for a connection from the slave node (i.e. the slave node cannot be found on the network). |
Primary/Unknown | This node is the master, but the slave node cannot be reached. |
UpToDate/Unknown | The database on the master is up to date, but the state of the database on the slave node is not known. |
Action Required: Make the connection manually. Refer to the instructions in “Manually Connecting the DRBD Slave to the Master”.
Note: If the master node reports WFConnection while the slave node reports StandAlone, it indicates a DRBD split brain. See “Correcting a DRBD Split Brain”.
Slave Node: StandAlone
1:r0/0 StandAlone Secondary/Unknown UpToDate/Unknown r-----
Summary: The slave cannot connect to the master.
StandAlone | The slave node is operating on its own. (StandAlone). |
Secondary/Unknown | The slave node is the secondary, but the primary cannot be found (Secondary/Unknown). |
UpToDate/Unknown | The database on the slave node is up to date, but the state of the one on the master is unknown (UpToDate/Unknown). |
Action Required: Make the connection manually. Refer to the instructions in “Manually Connecting the DRBD Slave to the Master”.
Note: If the master node reports WFConnection while the slave node reports StandAlone — it indicates a DRBD split brain. See “Correcting a DRBD Split Brain”.
Both Nodes: Secondary/Secondary
1:r0/0 Connected Secondary/Secondary UpToDate/UpToDate C r-----
Summary: The nodes are connected, but neither is master.
Connected | A connection is established. |
Secondary/Secondary | Both nodes are operating as the slave node. That is, each is acting as the peer that receives updates. |
UpToDate/Unknown | The database on the local node is up to date, but the state of the database on the remote node is not known. |
Action needed: This usually indicates a failure within the Pacemaker PostgreSQL resource group. For example, if Pacemaker cannot mount the DRBD device as a file system, DRBD will start successfully, but writing data to disk and database replication cannot take place.
To investigate the issue further:
- Use the Pacemaker Cluster Resource Monitor to verify if all services are running.
crm_mon -f
- Reset fail counts.
- Restart failed Pacemaker resources or the underlying Linux services.
Note: Solving this issue can be complex. If the above suggestions do not resolve the problem, consult your Avid representative for further troubleshooting.
Both Nodes: StandAlone and Primary
1:r0/0 StandAlone Primary/Unknown UpToDate/Unknown C r----- /mnt/drbd ...
1:r0/0 StandAlone Primary/Unknown UpToDate/Unknown C r-----
Summary: A DRBD “split brain” has occurred. Both nodes are operating independently, reporting themselves as the master node, and claiming their database is up to date.
StandAlone | Each node believes it is the DRBD master. Each will operate independently until an administrator manually intervenes. |
Primary/Unknown | The local node believes it is the DRBD master. The remote node is not connected which results in the Unknown status. Note: The key indicator of this type of DRBD split brain is both nodes reporting themselves as the Primary. |
UpToDate/Unknown | The database on the local node is up to date, but the state of the database on the remote node is not known. |
Action Needed: Discard the data on the slave node and reconnect it to the DRBD resource on the master node. Refer to the instructions in DRBD “Correcting a DRBD Split Brain”.
Solution A - Manually Connecting the DRBD Slave to the Master
When the master and slave nodes are not connecting automatically, you will have to make the connection manually. You do so by telling the slave node to connect to the resource owned by the master. The process below is only valid if the DRBD master node is in a WFConnection state.
To manually connect the DRBD slave to the master:
- Log in to any node in the cluster as root and start the Cluster Resource Monitor utility:
crm_mon
- To identify the slave, look for the line containing “Master/Slave Set”. For example:
Master/Slave Set: ms_drbd_postgres [drbd_postgres]
Masters: [ mcs-1 ]
Slaves: [ mcs-2 ]
Masters: [ mcs-1 ]
Slaves: [ mcs-2 ]
Note: In this situation, it is possible that the DRBD master may not be the same as the Pacemaker cluster master. Use the tools detailed in this document to identify the DBRB master node.
- On the slave node run the following command:
drbdadm connect r0
- Verify the reconnection was successful:
drbd-overview
- The output on the master node should resemble the following:
1:r0/0 Connected Primary/Secondary UpToDate/UpToDate C r----- /mnt/drbd ...
- The output on the slave node should resemble the following:
1:r0/0 Connected Secondary/Primary UpToDate/UpToDate C r-----
Solution B - Correcting a DRBD Split Brain
A DRBD split brain describes the situation in which both DRBD nodes are operating completely independently. Further, there is no connection between them, hence data replication is not taking place. A DRBD split brain must be remedied as soon as possible as data can be easily lost due to the lack of replication between the nodes.
To recover from a split brain, you must force the MCS cluster master node to take on the role of DRBD master. You then discard the database associated with the DRBD slave node, and reconnect it to the established master.
Note: Discarding the database on the slave node does not result in a full re-synchronization from master to slave. The slave node has its local modifications rolled back, and modifications made to the master are propagated to the slave.
To recover from a DRBD split brain:
- Log in to any node in the cluster as root and start the Cluster Resource Monitor utility:
crm_mon
- Identify the master node.
To identify the master, look for the line containing “Master/Slave Set”. For example:
Master/Slave Set: ms_drbd_postgres [drbd_postgres]
Masters: [ wavd-mcs01 ]
Slaves: [ wavd-mcs02 ]
Masters: [ wavd-mcs01 ]
Slaves: [ wavd-mcs02 ]
Note: It is possible that you may not be able to identify the master node through the Cluster Resource Monitor when DRBD is running in a split brain state. In this event you must determine the master node using your best judgment.
- On the master node run the following command:
drbdadm connect r0
This ensures the master node is connected to the r0 resource. This DRBD resource holds the databases. It was given the name r0 during the initial DRBD creation process.
- On the slave run the following command
drbdadm connect --discard-my-data r0
If the slave node is already in a "WFConnection" state, you will see the following message:
Failure: (102) Local address (port) already in use.
If you encounter this message, explicitly disconnect the slave node from the resource using the following command and then repeat the connect command:
drbdadm disconnect r0
- Verify the recovery was successful:
drbd-overview
- The output on the master node should resemble the following:
1:r0/0 Connected Primary/Secondary UpToDate/UpToDate C r----- /mnt/drbd ...
- The output on the slave node should resemble the following:
1:r0/0 Connected Secondary/Primary UpToDate/UpToDate C r-----
Comments
Post a Comment