Cluster Status

Introduction

FNZ Studio has some functionality to maintain the stability and resilience of the FNZ Studio clusters:

  1. REST endpoints can be used to query the status of the cluster or the status of the cluster nodes
  2. The Cluster Node Status Health Sensor reports the status of the cluster nodes.
  3. Finally, the Cluster Supervisor Service functionality improves the stability of the cluster in case of network partitioning (also known as 'split-brain' syndrome).

REST Endpoints

Cluster Status Endpoint

Endpoint:rest-api/cluster-status

The Cluster Status Endpoint reports the status of the cluster overall and returns a 200 HTTP response code if the cluster is acting normally and a 503 HTTP response code otherwise.

The following statuses can be reported by this endpoint:

  • STARTING - The cluster nodes are starting and not yet ready
  • RUNNING - The cluster is running successfully
  • RESIZING - The membership of the cluster is currently changing
  • UNSTABLE - The cluster is not stable due to some changes in the Hazelcast state
  • STOPPING - The cluster is stopping

The 200 HTTP response code is returned with the RUNNING and RESIZING statuses, while the 503 HTTP response is returned for all the other statuses.

Local Cluster Node Endpoint

Endpoint: rest-api/cluster-status/local-node

The Local Cluster Node Endpoint reports the status of the local cluster node and returns a 200 HTTP response code if the cluster node is acting normally and a 503 HTTP response code otherwise.

FNZ Studio load balancers should be configured so that they query this endpoint and only send traffic to a node if it is healthy (200 HTTP response code). The endpoint has to be queried on each cluster node separately.

The following statuses can be reported by this endpoint:

  • STARTING - The cluster node is starting and it is not yet ready to receive traffic
  • RUNNING - The cluster node is running and can receive traffic successfully
  • UNSTABLE - The cluster node is not stable due to some changes in the Hazelcast state
  • LOST_DATA - Not all data is available on this cluster node at the moment
  • STOPPING - The cluster node is stopping and it should not receive any traffic

The 200 HTTP response code is returned with the RUNNING status, while the 503 HTTP response is returned for all the other statuses.

REST Endpoints Authorization

These REST endpoints are accessible if a user is logged in or if the user (or agent) IP is whitelisted.

The access to the endpoint can be customized by specifying which IP addresses have access to it. This can be done by changing the property: nm.cluster.status.endpoints.allowed.ips.

The default value of this property is 127.0.0.1,0:0:0:0:0:0:0:1, so only local requests are allowed by default.

Cluster Node Status Sensor

The Cluster Node Sensor is a System Health Sensor that reports the status of the local cluster node.

Error code Message
OK The status of the local cluster node is RUNNING.
ERROR The status of the local cluster node is other than RUNNING.

Cluster Supervisor Service

During network partitioning (split brain) the activities that run on cluster nodes might return incorrect results or corrupt data (suspended Process Instances). Activity examples are: Process Engine, Threads (Node State Collector), and Scheduled Jobs.

A network partitioning is caused, in general, by network issues. However, during large Garbage Collections, it can also happen that the communication between cluster nodes is suspended long enough (more than 1 minute) that the node that runs the Garbage Collections is removed from the cluster.

To prevent this faulty behavior during network partitioning, the Cluster Supervisor Service functionality suspends the following activities on a cluster node that reaches the LOST_DATA status:

  • Process Engine Threads
  • Scheduled Jobs (including custom jobs)
  • Node State Collector thread

When the status of the cluster node changes to RUNNING, the activity on that node is resumed.