Mozilla explains January 2022 Firefox outage that blocked connections
On January 13, 2022, Firefox users around the world started reporting connection issues. The browser failed to connect to any site and users were reporting freezes and crashes.
Mozilla has posted a detailed technical explanation of the incident on the Mozilla hacks website on February 2, 2022.
The organization received reports of Firefox hanging during login attempts on January 13, 2022. At the time, it found that crash reports were increasing but did not have much information about the cause of the problem. .
Mozilla engineers discovered that a network request was causing crashes for Firefox users. Engineers have reviewed recent changes or updates but found none that could be causing the issue users are experiencing.
Mozilla suspected that the problem could have been caused by a recent “invisible” configuration change by one of the cloud providers it uses for load balancing. The organization uses the infrastructure of several providers for services such as incident reporting, telemetry, updating or certificate management.
The settings weren’t changed upon inspection, but engineers noticed that the telemetry service was serving HTTP/3 connections, which it hadn’t done before. HTTP/3 was disabled by Mozilla and users could finally use Firefox again to connect to services. The cloud provider’s HTTP/3 setting was configured with the value automatic.
Mozilla investigated the issue further after fixing the most pressing issue. All HTTP/3 connections go through the Necko network stack, but Rust components use a library called Viaduct to call Necko.
Necko checks if a header is present and if not, will add it. HTTP/3 relies on the header to determine request size. Necko checks are case sensitive. It happened then that the requests that transited by viaduct were automatically put in lowercase by the library; this meant that any overpass request that added a content-length header passed Necko but had issues with the HTTP/3 code.
The only Rust component that uses the network stack and adds a content-length header is the Firefox web browser’s Telemetry component. Mozilla notes that this is why disabling telemetry in Firefox solved the problem on the user side. Disabling HTTP/3 also solved it.
The problem would cause an infinite loop, which blocked further network communication because “all network requests go through a socket thread” according to Mozilla.
Mozilla claims to have learned several lessons from this issue. It studies all load balancers and reviews their configurations to prevent future issues like this. The deployment of HTTP/3 at Google, which was the cloud provider in question, was not announced. Finally, Mozilla plans to run more system tests in the future with “different HTTP versions”.
Mozilla reacted quickly to the emergency situation and resolved it. This may have damaged reputation and some users may have switched to another browser in the process. Mozilla should consider whether it’s a good idea to rely on cloud infrastructure operated by its biggest rival in the browser space. Some Firefox users may also suggest that the organization examine the browser’s handling of requests to ensure that unnecessary requests, such as reporting telemetry or reporting crashes, will never block connections that the user will try to establish in the future.
Now you: what is your opinion of the incident?