High Availability

Overview

High Availability (HA) mode allows the Intelligence Hub to run across multiple nodes, ensuring continuous operation and failover. Only one node is active at a time; standby nodes are ready to take over if the primary node fails. This is managed through heartbeats and file synchronization.

All HA coordination is managed through a PostgreSQL database. Nodes do not communicate directly with each other; instead, each node interacts independently with the PostgreSQL server for heartbeat management, leadership election, and data synchronization.

  • Heartbeats: Each node periodically signals its status to the PostgreSQL database. If the primary node stops sending heartbeats, a standby node automatically takes over.
  • File Synchronization: Configuration files in appData, which includes Connection and Pipeline State Extension caches, are regularly synced through the PostgreSQL database to ensure a standby node can seamlessly assume the primary role.

Setup

To setup a primary and secondary hub, use the following steps. More information on the command line arguments is provided in the sections below.

  1. Install or have available a PostgreSQL server
  2. Install the primary hub
  3. Run the create command on the primary hub to initialize the database tables used for synchronization. Note replace node1 with a unique name for the primary. Replace the JDBC URI with your SQL connection settings.
java -cp "intelligencehub-runtime.jar;lib/*" com.highbyte.intelligencehub.runtime.HAMain create -n node1Primary -j "jdbc:postgresql://localhost:5432/dbName?user=username&password=password"
  1. Install the secondary hub
  2. Copy the intelligencehub-certificatestore.pksc12 file from the appData of the primary hub to the appData of the secondary hub. While the certificates and private keys stored in this file are synchronized across HA instances, the encryption mechanisms used for encrypting and decrypting secrets remain on the local instance. Copying the pkcs12 file ensures the secondary hub can properly decrypt secrets needed for configuration.
  3. Launch the primary hub with the start command with the same nodeid used in create . This starts the hub using the database configuration
java -cp "intelligencehub-runtime.jar;lib/*" com.highbyte.intelligencehub.runtime.HAMain start -n node1Primary -j "jdbc:postgresql://localhost:5432/dbName?user=username&password=password"
  1. Launch the secondary hub with the start command using a new unique nodeid for the secondary
java -cp "intelligencehub-runtime.jar;lib/*" com.highbyte.intelligencehub.runtime.HAMain start -n node2Secondary -j "jdbc:postgresql://localhost:5432/dbName?user=username&password=password"

In the above setup the primary hub takes precedence, meaning if it’s online and running it takes control. The secondary only takes control when the primary is down.

Command Line Usage

Below are details on the command line arguments to configure and start HA. The primary commands are a follows:

  • create: Initializes the failover service, syncs initial configuration and state files, and sets the node ID as the preferred primary node.
    • The preferred primary node is always selected as primary when available.
  • start: Starts a node in HA mode, joining the failover group.
  • help: Displays usage information.

The basic command syntax looks as follows, with the arguments detailed below. It’s best to run these commands from the runtime directory.

java -cp "intelligencehub-runtime.jar;lib/*" com.highbyte.intelligencehub.runtime.HAMain [options] <create|start|help>
Option Description Example Value
-j, --jdbcURI JDBC URI for the node (can use environment variables) jdbc:postgresql://localhost:5432/dbName?user=username&password=password
env:URI
-n, --nodeId Unique node ID for the HA instance

Note that environment variables can be used for the database configuration commands to avoid passing sensitive information on the command line. Below is an example of using an environment variable.

java -cp "intelligencehub-runtime.jar;lib/*" com.highbyte.intelligencehub.runtime.HAMain start -n node1 -j env:URI

Failover Behavior

Heartbeat and Lease Management

The HA system uses a heartbeat mechanism to monitor node health and manage leadership:

  • Heartbeat interval: Every 5 seconds
  • Heartbeat expiration period: 10 seconds
  • Typical failover window: 5-15 seconds
    • Best case: ~5 seconds (standby checks immediately after primary failure)
    • Worst case: ~15 seconds (standby waits for next heartbeat + heartbeat expiration)

The primary node renews its heartbeat in the PostgreSQL database every 5 seconds. If a primary node fails to check in, the heartbeat expires after 10 seconds. Standby nodes also send heartbeats every 5 seconds, and the first node to send a heartbeat after the primary heartbeat expiration will be elected to be the new primary by the database.

File Synchronization

The following files are maintained and synchronized across all nodes in the HA cluster:

Configuration Files:

  • intelligencehub-configuration.json - Project configuration
  • intelligencehub-settings.json - System settings
  • intelligencehub-systemvariables.json - System variables
  • intelligencehub-remoteconfig.json - Remote configuration settings
  • intelligencehub-users.json - User accounts, roles, and API keys
  • intelligencehub-secrets.json - Encrypted secrets
  • intelligencehub-certificatestore.pkcs12 - Only certificates and private keys are maintained. The mechanisms used for encrypting and decrypting secrets will remain on the local instance. To decrypt keys across multiple HA instances, copy the pkcs12 file of the primary to each instance.
  • intelligencehub-identityproviders.json - Identity provider configurations
  • intelligencehub.license - License file

Data Stores:

  • intelligencehub-cache.db - Cache store
  • intelligencehub-state.db - State store

Configuration and data stores are synchronized between nodes to ensure a standby can seamlessly take over:

  • Primary mode sync interval: Every 5 seconds (pushes changes to database)
  • Secondary mode sync interval: Every 60 seconds (pulls changes from database)
  • Sync operations:
    • Deployment files are compared by hash before transfer. Files are synced when there is a change.
    • Cache and state stores only push the most recent 10,000 values per sync attempt
  • Data retention:
    • Data from the Data Stores are retained on the PostgreSQL database for up to 24 hours
    • Each transaction row is tracked relative to its upload time
    • Transactions that haven’t been updated and are older than 24 hours are automatically deleted
    • To miss data, more than 10,000 unique transactions per 5 seconds would need to occur in the cache or state stores

Promotion Process

When a Secondary Becomes Primary

  1. Pre-promotion sync: Secondary performs one final pull to ensure data consistency
  2. Runtime startup: Starts the main runtime with synchronized data
  3. Heartbeat confirmation: New primary renews heartbeat every 5 seconds

When a Primary Node Fails

  1. Heartbeat expiration: After 10 seconds without renewal, the heartbeat expires
  2. Leadership acquisition: Next heartbeat from a standby (within 5 seconds) acquires leadership
  3. Automatic promotion: Standby node detects acquisition and promotes to primary
  4. Service continuity: Total downtime is typically 5-15 seconds

When the Preferred Primary Comes Back Online

The system supports setting a preferred primary node configuration:

  • When the preferred primary node comes online, it will take over from the current primary
  • This happens during the next heartbeat cycle (within 5 seconds)
  • The transition follows the same promotion process (sync → promote → start)

Graceful Shutdown

Primary Node Shutdown

  1. Leadership release: Immediately expires the heartbeat (sets expiration to 10 seconds in the past)
  2. Runtime stop: Cleanly shuts down the active runtime
  3. Connection cleanup: Closes HA database connection
  4. Fast failover: By expiring the heartbeat immediately, standby nodes can acquire leadership on their next heartbeat (~5 seconds)

Database

Currently, only PostgreSQL is supported for the database.