Seems our load balancer is a little under powered and its stalling momentarily from time to time. Most of you probably didnt even notice it but some of you may have.

I am going to upgrade that today to the next level. May result in up to 5 minutes down time (the new architecture will have multiple load balancers).

Either some of the fixes I did yesterday wound up addressing the latency problem we had, or it was intermitten. But everything has been operating fine for the last 12 hours or so. I will continue to monitor the situation but for now everything is in the green on

Assuming no more problems trigger we and the team are continuing on building our beta environment which will enable for smooth update role outs int he near future. We have about 5 updates we developed in house we want to apply in the near future.

One problem fixed, another crops up...

So while that last fix did make things better, there seems to be a bit of lingering slowness for a few things, but things are mostly working.

I worked all day(and its midnight now) to bring up some very extensive monitoring tools for the team for us to help diagnose the issue.

I will work through the weekend to help improve things further, though I may need to get some sleep before I can fully resolve this. At least things are working other than a slight bit of lag. I will keep everyone on updated.

Today we had some problems on the server and it was slower than usual. Had to do with a mismatch in versions on docker engine. I managed to keep the server running while i diagnosed the problem, but it was noticably slow with the occasional 404.

Now that i found the problem I am applying it but you should already notice things are mostly back to normal. In a few minutes everything should be fully responsive again.

Sorry to everyone on for the recent difficulties, this was a very big migration for us and we are doing a LOT of work.. so breaks a few things up front but in the long term it will mean more updates, faster updates,and a more stable scalable system.

We still have about 5 updates we are planning to apply soon, we are just perfecting the environment first and setting up a beta environment, so stay tuned.

is back up after a short downtime. As far as I can tell the fix went smoothly. Hopefully that will address the last of the problems from migration.

In about 10 minutes will be going down shortly in an attempt to fix a 16G table that may be at the root of one small lingering problem post migration. Luckily we have good backups and the table can always be recreated from scratch.

So should be back up shortly hopefully with the last needed fix in place and we can start the upgrades soon.

Health checks and internal networks are now in place. It went smoothly!

This should ensure that if our system crashes **for any reason** it will automatically restore itself. Should give us better uptime in the future.


QT: qoto.org/@QOTO/111988351143316

QOTO Announcements & Polls  
We are going to try to add some healthchecks to our containers to help protect them in case they become unresponsive. No one on #QOTO should notice...

We are going to try to add some healthchecks to our containers to help protect them in case they become unresponsive. No one on should notice any real downtime unless i break something. If I do downtime should be mere seconds as I bring it back up.

Just letting people know in case anything goes wrong, shouldnt be noticed if things go right..

after another short down time everything is fixed! had to recreate an index that got lost in the migration which was slowing down the DB and everything else. No amount of resources was going to help.

But it is fixed now all ques are empty or very close to it. we will now downgrade the DB to a more sane level now that it is fixed (we upgrades shortly to maintain the system). But it will sill be a pretty hefty system for us, and we can always scale back up when needed.

TL;DR everything is wording fine now.

PS we are now going to work on a staging environment to test upgrades so we can safely start moving the main server through the upgrade cycle. stay tuned.

So we found the real problem haunting us. Turns out we didnt even need the bigger database. There was just an index that got dropped during the migration. We are working now to put it back in place. At which point things should be back up to their normal speed.

I just upgraded the DB server to x2 the CPU. This seems to fix the underlying issue. The pull queue (not related to most things) is now recovering as well.

So reviewing everything the next day at it seems almost everything is back and working with one exception that wont effect things too visibly.

One of our low priority sidekiq queues, the pull queue is backlogging now. All other queues are staying ahead of the curve.

This largely deals with pulling in remote media so you may see the occasional dead image. It is partly working though. We think the problem is with ulimits and working on it, shouldn't effect main operations and hopefully will be fixed soon. I will keep everyone updated.

It appears now that the backlog is resolved pages load quickly and images can be uploaded with little delay.

There seems to be one minor issue behind the scenes I need to tweak but overall it does appear to be working.

If you find any lingering issues please report them to one of the admins.
QT: qoto.org/@freemo/1119813978644

One last update before I go to bed and disapear for 12 hours.

The backlog has went from 1.2 mil at its peak earlier today to 0,4 mil now after we reconfigured things. It is steadily going down and should have everything up to working order before I get up.

One or two people were able to get images loaded after a VERY long wait. So while images still arent working it seems very likely related to the backlog. In a few hours when the backlog clears I expect image uploads should work again. If not I will check what the problem is in the morning.

Other than that most things appear to be working and everything should be functional soon.

The back log is about 2/3rds complete. This afternoon it peaked at 1.2 million and now it is ~0.4 million. I jut moved to pg_bouncer to sped that up a bit. Looks like I more than doubled the process time. Almost there.

@QOTO Recognizing all your hard work. I'm sure it isn't easy running a large mastodon instance on what's probably a mostly thankless job. So, on that note... thank you. It is certainly appreciated.

Woot is now 50% through catching up on the sidekiq backlog!

So the Sidekiq backlog on is still progressing. As of a few hours ago at the peak of the problem our backlog was 1.2 million jobs. As of right now its down to 0.7 million jobs. It is steadily decreasing there is just a lot to get through from the downtime. We are expecting everything to be back in working order when its done which should be around end of day. In the meantime things are still usable but you may experience very long lag on some actions.

Images still cant be uploaded, we hope this is the same problem.

As spam comes in we will block the servers on the . Please be patient this is happening across the whole fedi and we are working on better ways to address it.

update.

So I went to bed last night and work up to find the sidekiq workers were backlogging and we have 800K backlogged jobs. It was due to a misconfiguration that I have now fixed and it appears the backlog is quickly resolving itself.

If you noticed any weirdness this should be resolved int he next few hours as the backlog clears.

Show more
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.