On Monday morning we (Mozilla) detected a very large crash spike affecting #Firefox users on Linux, specifically on an older version of a Debian-based distribution.

It turned out to be an interesting bug involving the #Linux kernel and #Google JavaScript code so let me tell you about it.

A thread 🧵

bugzilla.mozilla.org/show_bug. 1/6

The crash started apparently out-of-the-blue, hitting thousands of Argentinian users on a Debian-based distro called Huayra, and specifically on version 5 which was based on Debian 10.

bugzilla.mozilla.org/show_bug.

Everybody seemed to crash while searching for images on Google. All versions of Firefox - even very old ones - were affected suggesting that the change didn't happen on our side, but on Google's. 2/6

Show thread

A colleague analyzed Firefox' behavior at the point of crash and realized that it happened during stack probing. The JIT touched the area that would hold the variables for the next JavaScript call and somehow hit an overflow.

bugzilla.mozilla.org/show_bug.

This is where things got weird, Google's code was allocating 20000 variables in a single frame. Ouch, that's probably some machine-generated code which went out of hand. Think twice before using ChatGPT to write code. 3/6

Show thread

But why was it crashing? Linux automatically extends the stack and we had reserved more than enough space, something that we confirmed by looking at the memory map of the affected processes.

Well it turns out that the Linux kernel used to have a check that prevented stack accesses that were too far from the stack pointer. Specifically accesses 64KiB + 256 bytes away would crash instead of extending the stack.

github.com/torvalds/linux/blob 4/6

Show thread

This was fixed in kernel 4.20 so users of more recent distros are unaffected, and we'll see if we can deploy a workaround to help users of older systems.

github.com/torvalds/linux/comm

It is interesting though that we find ourselves working around a bug we did not introduce triggered by code we do not control. 5/6

Show thread

And since we're at it let's shame Google for putting 20 thousand variables in a single function. Bad Google, no cookie.

Well no cookie anyway since Firefox has total cookie protection!

blog.mozilla.org/security/2021 6/6

Show thread

@gabrielesvelto

And since we’re at it let’s shame Google for putting 20 thousand variables in a single function. Bad Google, no cookie.

I once worked on a game engine that used ODE as its physics layer. At the core of ODE collision detection and handling was a function that built a Jacobian matrix on the stack (using alloca) to compute the forces to apply to objects colliding to separate them. We crashed on touching the stack redzone in Windows when our engine ran as a plugin in Internet Explorer—not something we could fix easily on our end, since the size of a thread redzone is decided at compile time by the application configs (which, again, application is Internet Explorer).

Filed a ticket against ODE maintainers and their response was basically “We don’t consider that application domain to be a meaningful one to fix bugs in.” So we fixed it on our end by #define-ing alloca away to a heap allocation in a tiny buffer.

Point of this story is: no shame on Google. Google doesn’t consider the Firefox browser on old Linux configurations a meaningful application domain to fix bugs in. And if you can’t point to where in the JavaScript language spec it says 20,000 variables is disallowed… Shame on Mozilla for having a noncompliant JS implementation. ;)

At least it was easy to fix.

It is interesting though that we find ourselves working around a bug we did not introduce triggered by code we do not control.

Oh yeah… That’s the nature of Internet software. It is interesting every time. :) I’ve had to get up from the keyboard and take a walk twice in my career, and the first time was when I realized if I’m going to be writing web software, that’s going to be, like, my whole career: stuff breaking because someone changed something somewhere that I was relying on for their own reasons. Internet software is like 1/3 technology and 2/3 social network effects.

@mtomczak this is sadly a very common occurrence for us. Just in the past two months we dealt with a couple of CPU bugs and an issue in a Rust crate that would only occur to people running Windows 7 installations w/o the SP1 installed on AVX-ready CPUs (yes, in 2023).

As for Google they reverted the change before we contacted them, so chances are that it was either wildly inefficient or it also messed Chrome up.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.