One problem fixed, another crops up...
So while that last fix did make things better, there seems to be a bit of lingering slowness for a few things, but things are mostly working.
I worked all day(and its midnight now) to bring up some very extensive monitoring tools for the team for us to help diagnose the issue.
I will work through the weekend to help improve things further, though I may need to get some sleep before I can fully resolve this. At least things are working other than a slight bit of lag. I will keep everyone on #QOTO updated.
Today we had some problems on the server and it was slower than usual. Had to do with a mismatch in versions on docker engine. I managed to keep the server running while i diagnosed the problem, but it was noticably slow with the occasional 404.
Now that i found the problem I am applying it but you should already notice things are mostly back to normal. In a few minutes everything should be fully responsive again.
Sorry to everyone on #QOTO for the recent difficulties, this was a very big migration for us and we are doing a LOT of work.. so breaks a few things up front but in the long term it will mean more updates, faster updates,and a more stable scalable system.
We still have about 5 updates we are planning to apply soon, we are just perfecting the environment first and setting up a beta environment, so stay tuned.
#QOTO is back up after a short downtime. As far as I can tell the fix went smoothly. Hopefully that will address the last of the problems from migration.
In about 10 minutes #QOTO will be going down shortly in an attempt to fix a 16G table that may be at the root of one small lingering problem post migration. Luckily we have good backups and the table can always be recreated from scratch.
So should be back up shortly hopefully with the last needed fix in place and we can start the upgrades soon.
Health checks and internal networks are now in place. It went smoothly!
This should ensure that if our system crashes **for any reason** it will automatically restore itself. Should give us better uptime in the future.
We are going to try to add some healthchecks to our containers to help protect them in case they become unresponsive. No one on #QOTO should notice any real downtime unless i break something. If I do downtime should be mere seconds as I bring it back up.
Just letting people know in case anything goes wrong, shouldnt be noticed if things go right..
after another short down time everything is fixed! #QOTO had to recreate an index that got lost in the migration which was slowing down the DB and everything else. No amount of resources was going to help.
But it is fixed now all ques are empty or very close to it. we will now downgrade the DB to a more sane level now that it is fixed (we upgrades shortly to maintain the system). But it will sill be a pretty hefty system for us, and we can always scale back up when needed.
TL;DR everything is wording fine now.
PS we are now going to work on a staging environment to test upgrades so we can safely start moving the main server through the upgrade cycle. stay tuned.
I just upgraded the #QOTO DB server to x2 the CPU. This seems to fix the underlying issue. The pull queue (not related to most things) is now recovering as well.
So reviewing everything the next day at #QOTO it seems almost everything is back and working with one exception that wont effect things too visibly.
One of our low priority sidekiq queues, the pull queue is backlogging now. All other queues are staying ahead of the curve.
This largely deals with pulling in remote media so you may see the occasional dead image. It is partly working though. We think the problem is with ulimits and working on it, shouldn't effect main operations and hopefully will be fixed soon. I will keep everyone updated.
It appears now that the #QOTO backlog is resolved pages load quickly and images can be uploaded with little delay.
There seems to be one minor issue behind the scenes I need to tweak but overall it does appear to be working.
If you find any lingering issues please report them to one of the admins.
One last update before I go to bed and disapear for 12 hours.
The backlog has went from 1.2 mil at its peak earlier today to 0,4 mil now after we reconfigured things. It is steadily going down and should have everything up to working order before I get up.
One or two people were able to get images loaded after a VERY long wait. So while images still arent working it seems very likely related to the backlog. In a few hours when the backlog clears I expect image uploads should work again. If not I will check what the problem is in the morning.
Other than that most things appear to be working and everything should be functional soon.
@QOTO Recognizing all your hard work. I'm sure it isn't easy running a large mastodon instance on what's probably a mostly thankless job. So, on that note... thank you. It is certainly appreciated.
Woot #QOTO is now 50% through catching up on the sidekiq backlog!
So the Sidekiq backlog on #QOTO is still progressing. As of a few hours ago at the peak of the problem our backlog was 1.2 million jobs. As of right now its down to 0.7 million jobs. It is steadily decreasing there is just a lot to get through from the downtime. We are expecting everything to be back in working order when its done which should be around end of day. In the meantime things are still usable but you may experience very long lag on some actions.
Images still cant be uploaded, we hope this is the same problem.
As spam comes in we will block the servers on the #Fediverse. Please be patient this is happening across the whole fedi and we are working on better ways to address it.
#QOTO update.
So I went to bed last night and work up to find the sidekiq workers were backlogging and we have 800K backlogged jobs. It was due to a misconfiguration that I have now fixed and it appears the backlog is quickly resolving itself.
If you noticed any weirdness this should be resolved int he next few hours as the backlog clears.
#QOTO is back up.
Please keep in mind #QOTO will need quite a few hours to handle the backlog from downtime. Tomorrow we are going to split out the workers so they can begin using the scaling. So tomorrow this should be fixed. For the next 24 hours expect things to be a bit slow and uploading pictures probably wont work until we fix that.
Give it 24 hours and hopefully things will be back to normal at that time.
We only posts announcements here and occasionally use polls to help us make decisions that effect our user base.
We don't usually respond quickly to direct messages to this account. If you need help with anything related to the QOTO servers, including moderation, then please contact one of our administrators. They are listed out about page: