Maintenance/Outage Alerts/OR01

From Xertion Wiki
Jump to navigation Jump to search


On 12-Apr-2020, between 7:00 PM and 7:30 PM CDT, the Xertion IRC Network experienced a cascading network-wide malfunction lasting a total of approximately 10 minutes, due to a configuration error.

What happened

Around 6:40 PM CDT on 12-Apr, root administrator IkarosBD began working on Xertion's webchat application to enable directly identifying to NickServ through the webchat application itself. Previously this functionality was not possible, but it was decided that it may make it more convenient for those who might prefer to use the webchat to be able to identify to their accounts more readily. This process was two-fold - first, changes would have to be made in the background to the webchat application's form layout to support password functionality. This was carried out without issue, but the field itself was left disabled until the second part of the process was completed. This second part involved the loading and configuration of a module on all servers on the Xertion IRC network known as 'passforward'. This module allows each IRCd to take the password normally used for connecting to servers (Xertion does not use this functionality, no servers are password-protected as Xertion is a fully open network), and pass it along to NickServ upon connecting to the network using the IRCd's internal PASS command.

In order to facilitate configuration file updates and make them as easy as possible to roll out, the necessary changes were made to copies of the IRCd's core configuration files and pushed out automatically to all servers at once via a Linux shell script. This script does not have any other functionality other than to simply push files and trigger a network-wide configuration reload remotely - it does not do syntax checking, nor does it do 'sanity' checks on the module configuration section (that is, it does not actually check to ensure the correct modules are specified in the file). It relies on the operator knowing to check the relevant files prior to running the script.

Unfortunately, a critical configuration error was missed during this process, involving our SSL/TLS module configuration - the operator failed to validate the loading of the correct SSL module in the file. Xertion IRCds are capable of using either the GnuTLS or OpenSSL encryption libraries to provide SSL/TLS support, by default GnuTLS is used on both the server-to-server linkage as well as for user connections over TLS. In both these cases, the TLS-enabled ports are bound explicitly to this library, so it is critical that the proper encryption module remain loaded at all times. If the encryption module were to be inadvertently switched (that is, one unloaded and the other loaded) without properly changing the corresponding TLS configuration parameters to match, encryption support across the entire network will break. In this case, the operator missed the fact that the modules configuration file he was editing, contained an older configuration set to load the OpenSSL module instead of the GnuTLS module. This failure set the stage for the upcoming malfunction.

Soon after 7:00 PM CDT, the necessary configuration change to load the passforward module was made to the master modules configuration file. Once done, the push script was automatically triggered to send the updated file to each server. Once all file transfers were done, the script then remotely sent an automatic reload signal to all servers to force them to re-read their configuration files. Because of the missed error, when this signal was received and servers began reloading their configurations, they all UNLOADED the GnuTLS module and LOADED the OpenSSL module. This had the effect of immediately unbinding all SSL-bound ports, triggering all servers to immediately disconnect from eachother as well as dropping all users currently connected over an encrypted connection. Normally, servers automatically recover from this so long as the correct corresponding TLS configuration is also in place. But since Xertion does not make active use of OpenSSL, this configuration did not exist. This had the effect of not only making it impossible to relink the network and re-enable the SSL ports, but it also caused fatal errors in several servers which subsequently caused them to crash - dumping even more user connections in the process. After the malfunction, every server and hub on the network still running, ended up completely isolated from eachother, effectively fracturing the entire network topology.

The error was very quickly realized and rectified after the malfunction occurred, with a corrected modules configuration file quickly pushed out by hand to all servers within minutes, and then subsequently made to reload their configs once more. The handful of crashed IRCds were then started back up without error. Finally, a network-wide reload was carried out once all servers were operational, forcing them to reload SSL credentials and ensuring full operation of all SSL/TLS ports.


This incident brought to light a few apparent flaws in validating our configuration files after making changes to them that affect the network as a whole. First, is that the push script does absolutely no checking whatsoever to ensure critical modules - such as our encryption module - are set to load in the configuration file itself. Second, is that it does not validate the TLS configuration against whichever module happens to be loaded. Each has configuration parameters unique to the module that can easily be used to verify the correct module is used with the correct corresponding configuration settings.

Changes to rectify these problems will be made in the coming days. In the meantime, manual validation will be done each and every time a change is made to any of the server's configuration files.

Back to main page