In June 2020, we upgraded the capacity of our DNS cluster to provide improved reliability and response times for our customers located in Asia. While doing this, we adjusted the capacity of all our nodes to be the same, allowing us to maintain a more consistently load balanced set of servers.
On Saturday, at around 2PM PST, the master node of our DNS cluster started experiencing higher loads than anticipated, due to most resolution queries favoring this node. The node started sending delayed responses, some of them timing out from the point of view of requesters.
Many of these requesters are our own devices, calling home for management-related tasks 1.
As our devices started seeing timeouts, they would immediately send another resolution request, retrying until they were successful.
As a result, our master DNS server started dramatically increasing the volume of log messages it would generate, overwhelming its own log rotation capabilities, until it ran out of space. It then became unavailable, thereby causing the retry attempts from our devices to snowball, creating a self-inflicted denial of service attack.
With the master server down, resolution requests started targeting the other nodes exclusively. Unfortunately, due to the now extremely high number of simultaneous requests, these servers also became overwhelmed and suffered the same fate as the master node.
At around 3:30PM PST, our monitoring systems started sending critical notifications about all DNS resolutions failing. The DevOps team quickly assessed the issue and assembled a recovery team including DevOps personnel, Technical Support representatives, our corporate Security Team and Main Stakeholders, as per our process.
During the outage, our customers could not create new configurations, or modify their managed devices' existing configurations.
Operational integrity was maintained, in that our customers' switches and access points continued passing traffic, enforcing security rules, etc. as configured.
However, customers using OV Cirrus as a proxy for their 802.1x or captive portal-based authentication could not authorize clients.
Features such as IoT devices classification were delayed.
We realized that simply restarting our DNS cluster would not allow it to absorb the now extremely high DNS traffic generated by our devices.
As a temporary measure, we re-provisioned the most impacted nodes to higher specifications, and rolled out hot standby backups.
We blocked all DNS traffic at the edge of our infrastructure in order to allow our DNS cluster to boot up and build consensus. We then re-authorized DNS traffic and spent the next few hours monitoring our infrastructure's response.
By 7:30 PM PST, DNS traffic, now successful, had settled back to its normal levels and our infrastructure and customer facing services were fully operational.
We will rebuild our DNS cluster again, this time so that every single node can handle the full load of all resolution requests, allowing it to withstand even a scenario where all other nodes become unavailable.
We will create rules in our orchestration that, in addition to our current disk usage monitoring, will perform automated cleanup of less critical resources in case of heightened storage space pressure.
Finally, we will add DNS failure scenarios to our existing disaster simulation exercises.
1 ALE networking equipment calls home when unconfigured: if our customers register their devices with OV Cirrus, a zero-touch configuration is automatically applied. Devices already configured to work with OV Cirrus will also call home on a regular basis to maintain configuration consistency, including software updates.