Cluster Status
Introduction
FNZ Studio has some functionality to maintain the stability and resilience of the FNZ Studio clusters:
- REST endpoints can be used to query the status of the cluster or the status of the cluster nodes
- The Cluster Node Status Health Sensor reports the status of the cluster nodes.
- Finally, the Cluster Supervisor Service functionality improves the stability of the cluster in case of network partitioning (also known as 'split-brain' syndrome).
REST Endpoints
Cluster Status Endpoint
Endpoint:rest-api/cluster-status
The Cluster Status Endpoint reports the status of the cluster overall and returns a 200 HTTP response code if the cluster is acting normally and a 503 HTTP response code otherwise.
The following statuses can be reported by this endpoint:
STARTING- The cluster nodes are starting and not yet readyRUNNING- The cluster is running successfullyRESIZING- The membership of the cluster is currently changingUNSTABLE- The cluster is not stable due to some changes in the Hazelcast stateSTOPPING- The cluster is stopping
The 200 HTTP response code is returned with the RUNNING and RESIZING statuses, while the 503 HTTP response is returned for all the other statuses.
Local Cluster Node Endpoint
Endpoint: rest-api/cluster-status/local-node
The Local Cluster Node Endpoint reports the status of the local cluster node and returns a 200 HTTP response code if the cluster node is acting normally and a 503 HTTP response code otherwise.
FNZ Studio load balancers should be configured so that they query this endpoint and only send traffic to a node if it is healthy (200 HTTP response code). The endpoint has to be queried on each cluster node separately.
The following statuses can be reported by this endpoint:
STARTING- The cluster node is starting and it is not yet ready to receive trafficRUNNING- The cluster node is running and can receive traffic successfullyUNSTABLE- The cluster node is not stable due to some changes in the Hazelcast stateLOST_DATA- Not all data is available on this cluster node at the momentSTOPPING- The cluster node is stopping and it should not receive any traffic
The 200 HTTP response code is returned with the RUNNING status, while the 503 HTTP response is returned for all the other statuses.
REST Endpoints Authorization
These REST endpoints are accessible if a user is logged in or if the user (or agent) IP is whitelisted.
The access to the endpoint can be customized by specifying which IP addresses have access to it. This can be done by changing the property: nm.cluster.status.endpoints.allowed.ips.
The default value of this property is 127.0.0.1,0:0:0:0:0:0:0:1, so only local requests are allowed by default.
Cluster Node Status Sensor
The Cluster Node Sensor is a System Health Sensor that reports the status of the local cluster node.
| Error code | Message |
|---|---|
| OK | The status of the local cluster node is RUNNING. |
| ERROR | The status of the local cluster node is other than RUNNING. |
Cluster Supervisor Service
During network partitioning (split brain) the activities that run on cluster nodes might return incorrect results or corrupt data (suspended Process Instances). Activity examples are: Process Engine, Threads (Node State Collector), and Scheduled Jobs.
A network partitioning is caused, in general, by network issues. However, during large Garbage Collections, it can also happen that the communication between cluster nodes is suspended long enough (more than 1 minute) that the node that runs the Garbage Collections is removed from the cluster.
To prevent this faulty behavior during network partitioning, the Cluster Supervisor Service functionality suspends the following activities on a cluster node that reaches the LOST_DATA status:
- Process Engine Threads
- Scheduled Jobs (including custom jobs)
- Node State Collector thread
When the status of the cluster node changes to RUNNING, the activity on that node is resumed.