System Startup And Shutdown

Introduction

FNZ Studio is a distributed system. Therefore startup and shutdown are complex - the nodes of the distributed system have to talk to each other. There are three phases in both startup and shutdown:

  1. Before init/stop of cluster service
  2. Init/Stop of cluster service
  3. After init/stop of cluster service

Furthermore, we distinguish four cases of startup and shutdown:

  1. Full system startup
  2. Full system stop
  3. Adding a node
  4. Removing a node

The four cases are described individually in the following chapters. Cheat sheets providing a summary are available for download: startup-and-shutdown-cheat-sheets.pdf

Important! For a step-by-step description of how to safely trigger a shutdown of the full cluster, see Shutting down a large cluster.

Full System Startup

During a full system startup, the following steps happen:

  1. Each application server is started.

  2. FNZ Studio starts on each node.

  3. FNZ Studio nodes connect to each other: If a minimum cluster size is configured, the nodes will wait for each other while creating the Hazelcast instance.

  4. One node goes first and performs a system-wide initialization.

  5. All other nodes follow.

imagescalerrenderer.png

Figure 1 : Full system startup

Figure 1 shows a full system startup of three FNZ Studio nodes, where the three vertical threads represent the timelines of the different nodes, and time flows from top to bottom. The horizontal lines separate the three phases (before/during/after cluster service startup).
In the following sections, the three phases are described in more detail.

Before Init of Cluster Service

This phase corresponds to the topmost part of Figure 1. Before the Cluster Service is started, no communication between the different nodes is possible and therefore every node starts individually. The application server and FNZ Studio are started on every node. FNZ Studio's context listener is called and the following steps are executed:

  1. Remember startup time (see Studio > Overview: Uptime)

  2. Find and validate data home (see the Data Home step in Figure 1)

    • Configure the data home based on the following configurations, where the precedence is according to the listed order
      • Servlet context attribute nm.data.home
      • Servlet init parameter nm.data.home * Java system property nm.data.home * Environment variable NM_DATA_HOME
        * Classpath default property nm.data.home
    • Required directory structure
      • Required: conf directory
      • Required to not exist: several subdirectories which existed in Appway 5.3
    • If not found or invalid, startup (of this node) fails
  3. Configure Log4j

    • Skip initialization if nm.log4j.initialize is false
    • Load configuration from the following locations in this order:
      • {nm.data.home}/conf/log4j.properties
      • classloader://com/nm/conf/log4j.properties
    • After successfully loading these properties, additional properties are tried to load from
      • {nm.data.home}/conf/log4j-additional.properties
  4. Check system, JAAS, and Hazelcast config

  5. Prepare data home directory structure

  6. Validate context class loader

  7. Start application service

  8. Register bean utils converters for XML digesters

  9. Initialize UID generator

    • Initialize with current system time
    • Load previous state from
      • {nm.data.home}/conf/uid.properties[.tmp]
      • Uses nm.uid.prefix
    • Save current state
      • Thread which runs every 5 minutes
      • Thread runs every 1 minute if many UIDs are requested
      • On shutdown
      • State stored to uid.properties.tmp and then renamed to uid.properties
  10. Load and validate configuration

    • Prepare configuration schema and property types
      • classloader://com/nm/conf/default.types.properties
    • Prepare certificates
      • classloader://com/nm/conf/numcom[2].crt
    • Load default properties
      • classloader://com/nm/conf/default.properties
    • Load unique and no-sync property names
      • classloader://com/nm/conf/unique.properties
        • nm.uid.prefix, nm.cluster.local.nodename
      • classloader://com/nm/conf/nosync.properties
        • nm.data.home
    • Load installation properties
    • Ensure license
    • Load content properties
    • Load java (aka system) properties
    • Load server properties
    • Load memory properties
    • Restrictions:
      • No nm.license.*
      • nm.* properties only if known
    • Overwrite order
      • {nm.data.home}/-
      • conf/installation.properties
      • conf/license.properties
      • conf/content.properties
      • conf/conf.properties
    • Validate
      • nm.uid.prefix: blank or valid
      • nm.cluster.local.nodename: not blank
    • Warn
      • If nm.uid.prefix is blank
    • If configuration can not be activated or is invalid, startup (of this node) fails [1]
  11. Initialize OWASP Enterprise Security API (ESAPI)

  12. Intitialize BeanShell framework

  13. Start cluster map config service

  14. Start adapter service (aka extension service)

  15. Start execution statistics service

  16. Start web request service

  17. Start plugins

    • If startup of any plugin fails, the startup fails

      [1] Note that configuration property nm.cluster.shutdown.jvm.stop.onStartupFailure (default = true) causes JVM to shut down in case of a forced cluster shutdown due to a startup failure. See also section 3.

Init of Cluster Service

This phase corresponds to the middle part of Figure 1. Upon initialization of the Cluster Service, the Hazelcast instance is created and from then on, communication between the nodes is available. When the Cluster Service init is called, the following steps happen:

  1. The Hazelcast instance is created on every node (see Hazelcast step in Figure 1)

    • Initial config is loaded from {nm.data.home}/conf/hazelcast.xml or classloader://com/nm/conf/hazelcast.xml
    • Default map configs are added to config
    • All map configs are collected and added to config (Appway 6.2)
    • Map configs are updated (eviction and near cache)
    • Map store implementations are created and added to config (default is filesystem)
    • Hazelcast is started and connects to configured cluster nodes
      • ? wait for min cluster size if configured
  2. Log message: "Hazelcast instance "appway" created."

  3. Hazelcast listeners are created

    • Membership listener
    • Client listener
    • Migration listener
  4. Test state of other nodes

    • If any other node is RUNNING ? set joining flag
    • If any other node is STOPPING or DONE ? stop as well
  5. Subscribe to key topic

    • Needed to support master password functionality
  6. If not joining a running cluster (joining flag is not set)

    • Try to set the init latch
    • If successful, continue
      • Log message: "Init latch set."
  7. If not successful, wait until the init latch is released

    • Log message: "Waiting for init latch..."
  8. This ensures that a single node can perform system initialization without interfering with other nodes

  9. Test state of other nodes - again

    • Ensure joining flag is correct
    • Shutdown if a node stopped while waiting for init latch
  10. Assert max cluster size is respected

    • Always init first node
    • Get max cluster size from license file
    • If cluster size below or equal to max cluster size, continue
    • Else stop first non-running node and test again
  11. Connect to cluster storage (see Cluster Storage step in Figure 1)

    • Touch and eagerly load data in all persistent maps
    • If a persistent map has eviction enabled no data is loaded
    • Touch and create non-persistent maps
    • Print sizes of all persistent maps
  12. Log message: "Connected to cluster data."

  13. Create lock pool for process instances

    • nm.cluster.lockpool.processinstances.size
      • defines size (default is 2048 for Appway 10 and lower, -1 for Appway 11 and higher)
    • nm.cluster.lockpool.processinstances.timeout
      • defines lock timeout (default = 120 seconds)
  14. Create entry listeners

    • Notified upon any change in a given map
  15. Subscribe to topics

    • Message listeners listen for messages published on a given topic
  16. Log message: "Cluster service ready".

After init of Cluster Service

This phase corresponds to the bottom part of Figure 1. After the Cluster Service is initialized, regular FNZ Studio start-up continues. Towards the end of the startup phase, the application state is set to RUNNING and the init latch is released. This triggers the other nodes to continue and finish with their startup. When the Context listener is called, the following steps happen:

  1. Start user service and key service
  2. Start repository
  3. Start cluster log service
    • Compact if first node
      • ? Left-over 10- blocks
      • ? Left-over 1d blocks
    • Replace in-memory cluster log appender
  4. Log message: "start-up phase 1 done."
  5. Find master password (see Master Password step in Figure 1)
    • Initialize key service if found

See the Master Password section for more details on the usage of this functionality.

  1. Start all remaining services (see Services step in Figure 1)

  2. Fire services started event (Appway 6.2)

  3. Start extensions (see Extensions step in Figure 1)

  4. Start data source registry

  5. Start process engine (see Process Engine step in Figure 1)

  6. Register JMX beans

  7. Fire application started event

    • Job scheduler starts
    • Dependency analysis starts
  8. Commit and clear any remaining thread-local variables

  9. Set application state to RUNNING

  10. Release init latch (see Release init latch step in Figure 1)

    • Now all other nodes continue and finish their startup
  11. Log message: "start-up phase 2 done."

Full System Shutdown

During a full system shutdown, the following steps happen:

  1. Publish cluster shutdown

  2. Perform system shutdown

  3. Stop cluster service

  4. Stop application server

    Important! For a step-by-step description of how to safely trigger a shutdown of the full cluster, see Shutting down a large cluster.

2.jpg

Figure 2 : Full system shutdown

Figure 2 shows a full system shutdown of three FNZ Studio nodes, where the three vertical threads represent the timelines of the different nodes, and time flows from top to bottom. The horizontal lines separate the three phases (before/during/after cluster service shutdown).

In the following, the three phases are described in more detail.

Before Stop of Cluster Service

This phase corresponds to the top part of Figure 2. After the cluster shutdown command was published to all nodes, system shutdown is triggered on each node. Services and extensions are stopped while the cluster service is still fully functional.

  1. Trigger on one node

    • Publish cluster shutdown (see Publish step in Figure 2)
      • REST: {URL}/rest/cluster/shutdown
      • JMX: ClusterServiceInfo
      • ClusterService: publishShutdown()
  2. Shutdown listener is called on each node (see Shutdown step in Figure 2)

    • Set cluster shutdown flag
    • Set application state to STOPPING
    • System shutdown is called
  3. Release breakpoints

  4. Stop process data service

    • Disable logout listener
    • Delete non-persistent process instances
    • Delete orphaned value stores
  5. Unregister JMX beans

  6. Fire application shutdown event

    • Stop services
    • Stop UID generator
    • Stop job scheduler
  7. Stop repository

  8. Fire services stopped event

  9. Stop extensions (see Extensions step in Figure 2)

  10. Dispose job scheduler

  11. Clear reflection caches

  12. Commit thread-local changes

  13. Stop cluster log service

Stop of Cluster Service

This phase corresponds to the middle part of Figure 2. After the cluster service is stopped, no more communication among the cluster nodes is possible. The last step of stopping the cluster service is to shutdown the Hazelcast node. The Cluster service shutdown is called and the following steps happen:

  1. Print dirty map entry counts

  2. Flush all dirty map entries (see Dirty Entries step in Figure 2)

  3. Wait until everything is saved

    • Warn and continue after two minutes
  4. Sync on cluster (see Synchronize 1 step in Figure 2)

    • Wait for at most three minutes
  5. Disable persistent storage functionality ** (see No Persistence step in Figure 2)

    • Nothing can be modified after this point
  6. Log Message: "Hazelcast map stores disabled."

  7. Sync on cluster (see Synchronize 2 step in Figure 2)

    • Wait for at most three minutes
  8. Shutdown Hazelcast node

    • After this point no communication between the nodes is possible
  9. Log Message: "Hazelcast shutdown done (node detached)."

After Stop of Cluster Service

This phase corresponds to the bottom part of Figure 2. After the cluster service is stopped, no communication among the nodes is possible anymore. The plugins are stopped and finally the application server is stopped on each node.
System shutdown is called and the following steps are executed:

  1. Dispose lock pool
  2. Stop plugins
  3. Print last Log4j message
  4. Log Message: "Appway will soon be stopped, but Application Server might continue running..."
  5. Stop Log4j
  6. Clear thread-local variables
  7. Set application state to DONE
  8. JVM is shut down [2]

Finally, you have to stop the application server, since FNZ Studio internal cluster shutdown cannot stop the application server itself.

[2] Note the following JVM-related configuration properties (System Configuration > Configuration Properties in FNZ Studio Composition):
  • nm.cluster.shutdown.jvm.stop (default = true) causes JVM to shut down after a regular cluster shutdown.
  • nm.cluster.shutdown.jvm.stop.onStartupFailure (default = true) causes JVM to shut down in case of a forced cluster shutdown due to a startup failure.

Adding a Node

  1. Before init of cluster service
    • Same as on full system startup
  2. Init of cluster service
    • Connect during creation of Hazelcast instance
      • Partition migration will be started immediately
    • Joining flag will be set ? Therefore, no init latch
    • The rest is the same as for a non-first node during full system startup
  3. After init of cluster service
    • The same as for a non-first node during full system startup

Removing a Node

  1. Before stop of cluster service
    • Ensure only one node stops at a time
    • Set application state STOPPING
    • Call system shutdown ? as for full system shutdown
  2. Stop of cluster service
    • Try flush at most two times
    • No sync on cluster
    • Shutdown Hazelcast node
  3. After stop of cluster service
    • The same as for a full system shutdown