Seq is inherently a single-node system. For disaster recovery purposes, a combination of nightly backups and event forwarding can be used to keep a second instance on stand-by.
Master key required
Nightly backups can only be restored using the master encryption key from the original server. See the instructions in the backup documentation to ensure the master key is stored appropriately.
Event forwarding is performed by the Seq.App.Replication app. The app is loaded into Seq and either all, or a selected subset of events, are sent to it. The app then uses the regular HTTP API to send events on to the second server.
Replication of events to the second server by this method is effective, but is not transactional. Events may be dropped under different conditions, depending on how the replication app is configured. Hence the second instance maintained by this method is useful for disaster recovery, but in cases where the primary server can be recovered this is still the desired outcome.
This configuration will forward events to a second disaster recovery (DR) Seq instance. In the case of an outage, it anticipates clients will buffer (or drop) events until the primary live server is recovered.
These instructions assume both the live and DR Seq instances are already installed and running on the target hardware.
Access to the DR instance
So that signals, API keys and so-on are not inadvertently saved to the DR instance by users, it should be locked down as much as possible with only minimal admin logins configured. User accounts will be created during recovery by restoring a nightly backup.
To control which events are replicated to the second server, a signal with a name such as "Replicated" is created on the live server. Click the '+' next to the heading "signals" in the right bar to do this.
If significant event volumes are expected (more than a few thousand per minute), to keep the performance impact of replication minimal it is advised that only events that would be retained more than a week are replicated. For most installations this means excluding debug-level events.
To add a filter to the signal, type the following into the filter bar and click
>> to add it to the active signal:
@Level <> 'Debug' and @Level <> 'Verbose'
After completing this step, the signal should appear as:
If the server is shared with non-administrative users, click the drop-down arrow beside the signal save/close buttons and select Restricted so that the signal is editable by administrators only.
Click the save icon to save the signal.
The disaster recovery instance will receive events from the primary server. To track events from the primary server, it is recommended that an API key is configured for the purpose on the secondary server.
Even if API keys are not required by the DR server's security configuration, an API key is still useful because it will allow incoming events from the live server to be tagged and filtered if this becomes necessary later on.
When creating the API key, here called "From Live", associate a property with incoming events. This enables any events received directly (deliberately or in error) to be identified.
The source property name is called
$Source so that it's unlikely to conflict with property names generated by Serilog and other libraries, though the
$ does mean that when filtering, escaped
@Properties['$Source'] syntax is required.
Note the API key token that is generated when the key is saved; this is required for the next step.
The replication plugin is a Seq app that is installed on the live server.
Under Settings > Apps, choose Install from NuGet and install Seq.App.Replication.
Back in the Apps screen, choose Start new instance to configure an instance of the replication app.
- Title - set a title for the app that reflects its purpose, e.g. "Replicate to DR"
- Manual input only - un-tick this setting so that a signal can be chosen
- Signal - choose the Replicated signal created in the previous step
- Server URL - this is the root URL of the DR instance
- API key - paste the token generated for the API key in the previous step
- Run the app on existing events - select this if there is existing data that needs to be replicated
The configured app with all applicable settings should look like:
Check that events are now being replicated from the live to the DR server. Only events matching the "Replicated" filter should appear (i.e.
Information and up), and these should be tagged with the
Replication policies applied to the live server won't be applied automatically on the DR machine. For this reason, at least one retention policy must be configured on the DR machine.
Don't forget to monitor free disk space on the DR machine. If disk space is exhausted, the DR instance will stop accepting events.
After loss of the live server, the DR machine can become the live server simply by restoring a nightly backup from the live server onto it. This will overwrite all configuration on the machine, so that it will appear from a configuration standpoint like the server it is replacing - users, signals, API keys and so-on from the live server will be restored onto the DR server.
When configuring a new DR instance to work alongside the now-live server, choosing a different value for the
$Source property is a good idea.
The event forwarding provided by Seq.App.Replication is best-effort, with some intrinsic limitations to be aware of:
- Event ids (the
@Idproperty associated with an event) are reassigned at the second server, so the same event will carry a different id on each server
- During shut-down of the live server, if the queue of events to be replicated cannot be flushed in 60 seconds to the DR instance, i.e because the DR instance is down or slow, any remaining events will be dropped
- Events enqueued for replication will be lost in a hard process crash, OS crash or power event
Exposure to the risk of lost events can be reduced by enabling durable log shipping in the configuration of the replication app, however the performance overhead and disk usage requirements of using this mode can be an issue for loaded servers.
This configuration is similar in normal operation to the basic setup, but takes additional steps so that the DR server can accept and store events during a brief outage of the live server, and forward these back to the live server upon recovery.
Only ingestion traffic should "fail over" to the DR instance; the UI and API must remain pointing to the live server unless a full recovery is performed. Using a separate ingestion port can help with this - see JSON Configuration for instructions regarding ingestion port setup.
To configure reciprocal forwarding:
- Create an API key "From DR" on the live server, and set the property value
$Source = DRas an applied property
- Edit the "Replicated" signal on the live server, adding the filter
- Create a "Replicated" signal on the DR server, with the filter
not Has(@Properties['$Source'])and a filter excluding debug-level events
- Install the replication app on the DR server, configuring it to forward to the live server using the API key token from (1), with Use durable log shipping enabled
Reciprocal forwarding via this technique will work with the live server is down less than 48 hours. Beyond this time limit, durable log shipping will start dropping buffered events. If the outage cannot be resolved within 24 hours, the recovery process should be used to promote the DR machine to become the new live machine.
API keys and security
In order to fail-over successfully to the DR instance, applications will need to be able to write logs to it. This generally necessitates allowing events to be ingested on the DR machine without valid API keys: un-check the "require an API key when writing events" option under Settings > API keys on the DR machine.
Updated almost 2 years ago