Website performance - Apr. 12 (PDT) update

Apr 12, 2010 16:15

Hello everyone, here is the update from the weekend. I will cover the 2 points from the previous lj_maintenance posts so we can close the loop on them.

Speed up the way our user information is pulled out of our databases.
There were 2 more emergency patches released this weekend that have been helping the situation. We will continue to look at additional ways we can do this and is an on-going process.

Out of the 10 servers where we store our data, the first one is really struggling to keep up with the demands on it, we plan to move some of the users on that first server around so that the requests for information are spread around better.
Please keep in mind that this only directly impacted 1 community (ohnotheydidnt) and was not planned for any other journal. We attempted this on Friday night/Saturday morning but the move bombed out after 3-4 hours even though we did a slight modification the 2nd time around. We are redoing the 'move' program to be able to handle moving such a large community.

I'd like to clarify some things about ONTD that I saw pop up in the comments on the last post. The performance problems are not due to ONTD and it's not just the database that they are on that is having problems. Getting rid of a particular community is NOT going to solve our problems. And even though this move did not work this past weekend, the move itself is *not* a pre-requisite to us having a fully functioning, well-performing site.



  1. Last week we noticed that the SQL select that populates a journal calendar with how many posts you have per day was not Memcached. This select was almost 50% of selects on the database that ONTD was on, and 20-40% on all other databases. Because of the sheer number of posts-per-day that ONTD has, this was really really hurting that database. We are now going through Memcache for this select but we've pushed the problem around a bit as our Memcache tier is not as built as we would like. The Memcache expansion was approved for later in this year but we will try to push the timeline up. Performance should still be a LOT better than it was before this change.
  2. 2 more emergency patches Saturday: Subscription notification info is now being pulled from the slave side of the database cluster ( change log entry) and we are no longer loading the entire inbox just to delete a few entries ( changelog entry).
  3. Changed innodb_flush_log_at_trx_commit from default of "1" to "2". This alone dropped our SQL connections to 2/10 of what we were seeing before we changed it. (Yeah... those of you that are DBAs and sysadmins, I see what gesture you're making right now and that's pretty rude!) We had upgraded our MySQL servers to 5.0 from (ahem) something much much earlier than 5.0 and did not get around to all the new tunables. No, the upgrade was not an initial factor as we started the upgrade *after* we started having problems.

TO DO
-----
  1. After the 'move' script has additional error handling put in, we will attempt to move ONTD to a new cluster as this is part of our capacity plan.
  2. Continue to look at MySQL schema changes.
  3. See what other things we can do to lessen, or spread (which still lessens), load on master side of the database cluster.


Also, status.livejournal.org will be updated soon after I post this. I've always wanted to keep "status" short and sweet, and since we were having slow load times, I thought linking to lj_maintenance would allow us to elaborate (in lj_maintenance) while still keeping the actual status notification on point. That wasn't well thought through on my part; instead, this understandably added to the frustration. In the future, there will be no more linkage back to maintenance post. At the most, we'll have an abbreviated outline of what's going on and ETA's (if they're known at the time). You have spoken and we are listening.

If you're still wondering why there are links back to general support communities, it's because it's under the "If you're experiencing other issues" sub-heading where "other" is presumably a non-outage; it's designed to help those of us having specific problems to easily get help rather than trying to dig through the site to find Support.
Previous post Next post
Up