My former employer, @OCCRP , just went live with their new website, and it is pretty slick!
occrp.org/en

This is bitter-sweet for me.

On one hand, so glad to see them have a new site, finally! The old site was a mess.

On the other: I had designed and built the infra that hosted their site through Panama Papers (arguably OCCRP's big break). Infra that did not rely on external CDNs or "DDoS-protection" providers.

That infra is no longer in use as of today. Replaced by Google. 🥲

For those curious, the infra was mostly:

- a pair of back-end servers (the main site was an ancient Joomla install…), in a production / warm standby configuration;

- a couple dozen very thin VPSes acting as (micro-) caching reverse proxies; we called them "fasadas" (from Bosnian word for a façade);

- a bunch of scripts that tied it all together.

The stripped down and simplified nginx config for the fasadas lives as a FLOSS project here:
0xacab.org/rysiek/fasada

The production / warm standby back-end servers were automagically synced every hour. Yes, including the database.

This meant that:

1. we had a close-to-production testing server always available;

2. we had a way of quickly switching to almost completely up-to-date backup back-end server in case anything went down with the production.

The set-up on these back-ends included *two* nginx instances running parallel on different ports but with same config, serving same content.

Yes, on each.

Each fasada (i.e. reverse proxy on the edge) was configured to use *both* of these nginx instances on the currently-production back-end server.

Because everything was in docker containers, we could upgrade each nginx instance separately.

Whenever we were deploying nginx config changes or were upgrading nginx itself, we would do that one instance at a time. If it got b0rked, fasadas would just stop using the b0rked back-end nginx instance and switch to the other one.

No downtime. No stress.

IP addresses of active fasadas (that is, ones that were supposed to handle production traffic) were simply added as A records for `occrp.org`.

This was Good Enough™, as browsers were already smart about selecting an endpoint IP address and sticking to it across requests related to the same domain.

This also meant that if an active fasada went under for whatever reason, visitors would mostly not notice – their browsers would retry against one of the remaining IPs.

We had about 2 dozen fasadas configured, deployed, and ready to serve production traffic at any given time.

But we only kept 4-6 actually active for `occrp.org` (and some others for other sites we hosted).

The other ones were an "emergency stash".

If an active fasada did go under, we'd swap its IP address out of occrp.org A records, and add one of the currently healthy standbys instead.

If we started getting way more traffic than the current active fasada set could handle, we'd add more.

From my experience, what brings a site down really rarely is an *actual* DDoS. Most of the time it is organic traffic spike hitting a slow back-end.

Hence:
1. microcaching
2. my exasperation with CloudFlare calling everything a DDoS 🙄

But I digress!

We did get honest-to-Dog DDoSes, some pretty substantial. When that happened we just… swapped out *all* active fasadas.

The DDoS would happily continue against the 4 to 6 old IP addresses… While new visitors would get served from other nodes. 😸

See, when you're DDoSing someone, you don't want to waste your bandwidth on checking DNS records, now do you? You want to put everything you've got into these malicious packets.

And when you do, and the target just moves on to a different set of IP addresses, you're DDoSing something that does not matter. Have at it! :blobcatcoffee:

Now, I am not saying *all* DDoSes work this way.

I *am* saying that all the DDoSes I have seen against OCCRP's infra when I was there worked this way.

The time we really went down hard was when our dedi provider (which was otherwise great!) overeagerly blackholed DDoS traffic…

…blackholing also our production back-end server.

Took us 45min to deal with this, mainly because I was out at lunch and for *once* I did not take my phone with me. While a certain @smari happened to be on vacation literally on the other side of the globe.

Dealing with this meant pushing a quick config change to the fasadas to switch to the warm spare back-end.

:oof: What a blast from the past!

I should probably write this all up in a blogpost, with some more lessons-learned (for example: remember to microcache your 4xx/5xx errors as well).

Thanks for joining me for this ride down the memory lane!

I will now take your questions. :blobcatcoffee:

@rysiek I worked my back up the thread because I read "microcache" as "microfiche" ... ahem!

Interesting stuff even if you stuck to current-century tech, thank you!

@rysiek Q: what TTL did you typically have on your A records?

@rysiek @DamonHD

Does this mean that your way of mitigating ddoses would kick in on that timescale?

@robryk @DamonHD for some visitors it would be instantenous, if their recursive resolvers have not cached occrp.org A records yet.

For those whose did, the worst case scenario is roughly 2×TTL if the request happens *just* before we push DNS changes.

There are nuances and caveats, but that's an effective enough way of thinking about it.

@rysiek @DamonHD wouldn't it be ~1xTTL+epsilon? (The recursive resolver would reask once TTL has expired, would promptly get the new answer, with only requests it was serving before it got the new answer getting old responses, no?)

@robryk @DamonHD there are all sorts of small random delays that can push it over the edge and mean that a recursive resolver still serves the cached response even though technically the TTL should have *just* expired.

Or, a recursive resolver gets a request from user A *just* before DNS changes are pushed, caches that. Then user B issues a request *just* before the TTL expires and gets the cashed response from the recursive resolver.

Follow

@rysiek @DamonHD ah, I just realized that I don't know how DNS TTL works in _clients_. Thanks, I will need to look it up.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.