Last Tuesday, we changed the filesystem we use to store our users' files over from XFS to ext4fs. This required a much longer maintenance outage than normal -- 2 hours instead of our normal 20-30 minutes.
This post explains why we made the change, and how we did it.
tl;dr for PythonAnywhere users:
We discovered that the quota system we were using with XFS didn't survive hard fileserver reboots in our configuration. After much experimentation, we determined that ext4 handles our particular use case better. So we moved over to using ext4, which was hard, but worthwhile for many reasons.
tl;dr for sysadmins:
- Don't use XFS with quotas on current Ubuntu LTS (or any kernel between 3.11 and 3.17), or you're likely to have long downtime after any unclean shutdown
- If you're doing big rsyncs, it's worth investigating parallelising them with xargs.
A bit of architecture
In order to understand what we changed and why, you'll need a bit of background about how we store our users' files. This is relatively complex, in part because we need to give our users a consistent view of their data regardless of which server their code is running on -- for example so they see the same files from their consoles as they do from their web apps, and so all of the worker processes that make up their web apps can see all of their files -- and in part because we need to keep everything properly backed up to allow for hardware failures and human error.
The PythonAnywhere cluster is made up of a number of different server types. The most important for this post are execution servers, file servers, and backup servers.
Execution servers are the servers where users' code runs. There are three kinds: web servers, console servers, and (scheduled) task servers. From the perspective of file storage, they're all the same -- they run our users' code in containers, with each user's files mounted into the containers. They access the users' files from file servers.
File servers are just what you'd expect. All of a given user's files are on the same file server. They're high-capacity servers with large RAID0 SSD arrays (connected using Amazon's EBS). They run NFS to provide the files to the execution servers, and also run a couple of simple services that allow us to manage quotas and the like.
Backup servers are simpler versions of file servers. Each file server has its own backup server, and they have identical amounts of storage. Data that is written to a file server is asynchronously synchronised over to its associated backup server using a service called drbd.
Here's a diagram of what we were doing prior to the recent update:
This architecture has a number of benefits:
- If a file server or one of its disks fails, we have an almost-up-to-date (normally within milliseconds) copy on its associated backup server.
- At the cost of a short window when disks aren't being kept in sync by drbd, we can do point-in-time snapshots of all of the data without adding load to the file server. We just log on to the backup server, use drbd to disconnect it from the file server, then snapshot the disks. Once that's done, we reconnect it. Prior to using a separate backup server for this, our daily backups visibly impacted filesystem performance, which was unacceptable. They were also "smeared" -- that is, because files were being written to while they were being backed up, the files that were backed up first would be from a point in time earlier to the ones that were backed up later.
- If we want to grow the disk capacity on a file server, we can add a new set of disks to it and to its backup server, RAID0 them together for speed, and then add that to the LVM volumes on each side.
- It's even possible to move all of PythonAnywhere from one Amazon datacenter to another, though the procedure for that is complicated enough to be worthy of a separate blog post of its own...
As you can see from the diagram, the filesystem we used to use to store user data was
XFS. XFS is a tried-and tested journaling filesystem,
created by Silicon Graphics in 1993, and is
perfect for high-capacity storage. We actually started
using it because of a historical accident. In an early prototype of PythonAnywhere,
all users actually mapped to the same Unix user. When we introduced disk quotas
(yes, it was early enough that we didn't even have disk quotas) this was a problem.
At that time, we couldn't see any easy way to change
the situation with Unix users (that changed later) so we needed some kind of quota
system that allowed us to enforce quotas on a per-directory basis, so that (eg.)
/home/someuser had a quota of 512MB and
/home/otheruser had a quota of 1GB.
But most filesystems that provide quotas only support it on a per-user basis.
XFS, however, has a concept of "project quotas". A project is a set of directories, and each project can have its own independent quota. This was perfect for us, so of the tried-and-tested filesystems, XFS was a great choice.
Later on, of course, we worked out how to map each user to a separate Unix user -- so the project quota concept was less useful. But XFS is solid, reliable, and just as fast as, if not faster than, other filesystems, so there was no reason to change.
How things went wrong
A few weeks back, we had an unexpected outage on a core database instance that supports PythonAnywhere. This caused a number of servers to crash (coincidentally due to the code we use to map PythonAnywhere users to Unix users), and we instituted a rolling reboot. This has happened a couple of times before, and has only required execution server reboots. But this time we needed to reboot the file servers as well.
Our normal process for rebooting an execution server is to run
sync to synchronise
the filesystem (being old Unix hands we run it three times "just to be safe", despite
the fact that hasn't been necessary since sometime in the early '90s) and then to do
a rapid reboot by
echoing "b" to
File servers, however, require a more gentle reboot procedure, because they have
critical data stored on them, and are writing so much to disk that stuff can
change between the last
sync and the reboot, so a normal slow
reboot command is
This time, however, we made a mistake -- we used the execution-server-style hard reboot on the file servers.
There were no obvious ill effects; when everything came back, all filesystems were up and running as normal. No data was lost, and the site was back up and running. So we wiped the sweat from our respective brows, and carried on as normal.
We first noticed that something was going wrong an hour or so later. Some of our users started reporting that instead of seeing their own disk usage and quotas on the "Files" tab in the PythonAnywhere web interface, they were seeing things like "1.1TB used of 1.6TB quota". Basically, they were seeing the disk usage across the storage volumes they were linked to instead of the quota details specific to their accounts.
This had happened in the past; the process of setting up a new project quota on XFS can take some time, especially when a volume has a lot of them (our volumes had tens of thousands) and it was done by a service running on the volume's file server that listened to a beanstalk queue and processed updates one at a time. So sometimes when there was a backlog, people would not see the correct quota information for some time.
But this time, when we investigated, we discovered tons of errors in the "quota queue listener" service's logs.
It appeared that while XFS had managed to store files correctly across the hard reboots, the project quotas had gone wrong. Essentially, all users now had unquota'd disk space. This was obviously a big problem. We immediately set up some alerts so that we could spot anyone going over quota.
We also disabled quota reporting on the PythonAnywhere "Files" interface, so that people wouldn't be confused. Or, indeed, to make sure that people didn't guess what was up and try to take advantage by using tons of storage, and cause problems for other users... we did not make any announcement about what was going on, as the risks were too high. (Indeed, this blog post is the announcement of what happened :-)
So, how to fix it?
Getting the backups back up
In order to get quotas working again, we'd need to run an XFS quota check on the affected filesystems. We'd done this in the past, and we'd found it to be extremely slow. This is odd, because XFS gurus had advised us that it should be pretty quick -- a few minutes at most. But the last time we'd run one it had taken 20 minutes, and that had been with significantly smaller storage volumes. If it scaled linearly, we'd be looking at at least a couple of hours' downtime. And if it was non-linear, it could be even longer.
We needed to get some kind of idea of how long it would take with our current data size. So, we picked a recent backup of 1.6TB worth of RAID0 disks, created fresh volumes for them, attached them to a fresh server, mounted it all, and kicked off the quota check.
24 hours later, it still hadn't completed. Additionally, in the machine's
there were a bunch of errors and warnings about blocked processes. The kind of errors
and warnings that made us suspect that the process was never going to complete.
This was obviously not a good sign. The backup we were working from pre-dated the erroneous file server reboots. But the process by which we'd originally created it -- remember, we logged on to a backup server, used drbd to disconnect from its file server, did the backup snapshots, then reconnected drbd -- was actually quite similar to what would have happened during the server's hard reboot. Essentially, we had a filesystem where XFS might have been half-way through doing something when it was interrupted by the backup.
This shouldn't have mattered. XFS is a journaling filesystem, which means that it can be (although it generally shouldn't be) interrupted when it's half-way through something, and can pick up the pieces afterwards. This applies both to file storage and to quotas. But perhaps, we wondered, project quotas are different? Or maybe something else was going wrong?
We got in touch with the XFS mailing list, but unfortunately we were unable to explain the problem with the correct level of detail for people to be able to help us. The important thing we came away with was that what we were doing was not all that unusual, and it should all be working. The quotacheck should be completing in a few minutes.
And now for something completely different
At this point, we had multiple parallel streams of investigations ongoing. While one group worked on getting the quotacheck to pass, another was seeing whether another filesystem would work better for us. This team had come to the conclusion that ext4 -- a more widely-used filesystem than XFS -- might be worth a look. XFS is an immensely powerful tool, and (according to Wikipedia) is used by NASA for 300+ terabyte volumes. But, we thought, perhaps the problem is that we're just not expert enough to use it properly. After all, organisations of NASA's size have filesystem experts who can spend lots of time keeping that scale of system up and running. We're a small team, with smaller requirements, and need a simpler filesystem that "just works". On this theory, we thought that perhaps due to our lack of knowledge, we'd been misusing XFS in some subtle way, and that was the cause of our woes. ext4, being the standard filesystem for most current Linux distros, seemed to be more idiot-proof. And, perhaps importantly, now that we no longer needed XFS's project quotas (because PythonAnywhere users were now separate Unix users), it could also support enough quota management for our needs.
So we created a server with 1.6TB of ext4 storage, and kicked off an rsync to copy the data from another copy of the 1.6TB XFS backup the quotacheck team were using over to it, so that we could run some tests. We left that rsync running overnight.
When we came in the next morning, we saw something scary. The rsync had failed halfway through with IO errors. The backup we were working from was broken. Most of the files were OK, but some of them simply could not be read.
This was definitely something we didn't want to see. With further investigation, we discovered that our backups were generally usable, but in each one, some files were corrupted. Clearly our past backup tests (because, of course, we do test our backups regularly :-) had not been sufficient.
And clearly the combination of our XFS setup and drbd wasn't working the way we thought it did.
We immediately went back to the live system and changed the backup procedure.
We started rolling "eternal rsync" processes -- we attached extra
(ext4) storage to each file server, matching the existing capacity, and ran looped scripts
that used rsync (at the lowest-priority
ionice level) to make sure that all user
data was backed up there.
We made sure that we weren't adversely affecting
filesystem performance by checking out an enormous git repo into one of our own
PythonAnywhere home directories, and running
git status (which reads a bunch of files)
regularly, and timing it.
Once the first eternal rsyncs had completed, we were 100% confident that we really did have everyones' data safe. We then changed the backup process to be:
- Interrupt the rsync.
- Make sure the ext4 disks were not being accessed
- Back up the ext4 disks
- Kick off the rsync again
This meant that we could be sure that the backups were recoverable, as they came from a filesystem that was not being written to while they happened. This time we tested them with a rsync from disk to disk, just to be sure that every file was OK.
We then copied the data from one of the new-style backups, that had come from an ext4 filesystem, over to a new XFS filesystem. We attached the XFS filesystem to a test server, set up the quotas, set some processes to reading from and writing to it, then did a hard reboot on the server. When it came back, it mounted the XFS filesystem, but quotas were disabled. Running a quotacheck on the filesystem crashed.
Further experiments showed that this was a general problem with pretty much any project-quota'ed XFS filesystem we could create; in our tests, a hard reboot caused a quotacheck when the filesystem was remounted, and this would frequently take a very long time, or even crash -- leaving the disk only mountable with no quotas.
We tried running a similar experiment using ext4; when the server came back after a hard
reboot, it took a couple of minutes checking quotas and a few harmless-seeming warnings
syslog. But the volumes mounted OK, and quotas were active.
Over to ext4
By this time we'd persuaded ourselves that moving to ext4 was the way forward for dimwits like us. So the question was, how to do it?
The first step was obviously to change our quota-management and system configuration code so that it used ext4's commands instead of XFS's. One benefit of doing this was that we were able to remove a bunch of database dependencies from the file server code. This meant that:
- A future database outage like the one that triggered all of this work wouldn't cause file server outages, so we'd be less likely to make the mistake of hard-rebooting one of them.
- Our file server-database dependency was one of the main blockers that have been stopping us from moving to a model where we can deploy new versions of PythonAnywhere without downtime. (We're currently actively working on eliminating the remaining blockers.)
It's worth saying that the database dependency wasn't due to XFS; we were just able to eliminate it at this point because we were changing all of that code anyway.
Once we'd made the changes and run it through our continuous integration environment a few times to work out the kinks, we needed to deploy it. This was trickier.
What we needed to do was:
- Start a new PythonAnywhere cluster, with no file storage attached to the file servers.
- Shut down all filesystem access on the old PythonAnywhere to make sure that the files were stable.
- Copy all of the data from all XFS filesystems to matching ext4 filesystems
- Move the ext4 filesystems over to the new cluster.
- Activate the new cluster.
Parrallelise rsync for great good
The "copy" phase was the problem. The initial run of our eternal rsync processes made it clear that copying 1.6TB (our standard volume size) from a 1.6TB XFS volume to an equivalent ext4 one took 26 hours. A 26 hour outage would be completely unacceptable.
However, the fact that we were already running eternal
rysync processes opened up some other
options. The first sync took 26 hours, but each additional one took 6 hours -- that is, it took
26 hours to copy all of the data, then after that it took 6 hours to check for any changes between
the XFS volume and the ext4 one it was copying them to that had happened while the original
copy was running, and to copy those changes across. And then it took 6 hours to do that again.
We could use our external rsync target ext4 disks as the new disks for the new cluster, and just sync across the changes.
But that would still leave us with a 6+ hour outage -- 6 hours for the copy, and then extra time for moving disks around and so on. Better, but still not good enough.
Now, the eternal rsync processes were running at a very high
ionice level so as
not to disrupt filesystem access on the live system. So we tested how long it would take to
run the rsync with the opposite, resource-gobbling niceness settings. To our surprise, it
didn't change things much; a rsync of 6 hours' worth of changes from an XFS volume to an ext4 one
took about five and a half hours.
We obviously needed to think outside the box. We looked at what was happening while we ran one
of these rsyncs, in
iotop, and noticed that we were nowhere near maxing out our CPU
or our disk IO... which made us think, what happens if we do things in parallel?
At this point, it might be worth sharing some (slightly simplified) code:
#!/bin/bash # Parameter $1 is the number of rsyncs to run in parallel cd /mnt/old_xfs_volume/ ls -d * | xargs -n 1 -P $1 ~/rsync-one.sh
#!/bin/bash mkdir -p /mnt/new_ext4_volume/"$1" rsync -raXAS --delete /mnt/old_xfs_volume/"$1" /mnt/new_ext4_volume/
For some reason our notes don't capture, on our first test we went a bit crazy and used
rsyncs, for a total of about 2,000 processes.
It was way better. The copy completed in about 90 minutes. So we experimented. After many, many tests, we found that the sweet spot was about 30 parallel rsyncs, which took on average about an hour and ten minutes.
We believed that the copy would take about 70 minutes. Given that this deployment was going to require significantly more manual running of scripts and so on than a normal one, we figured that we'd need 50 minutes for the other tasks, so we were up from our normal 20-30 minutes of downtime for a release to two hours. Which was high, but just about acceptable.
The slowest time of day across all of the sites we host is between 4am and 8am UTC, so we decided to go live at 5am, giving us 3 hours just in case things went wrong. On 17 March, we had an all-hands-on deck go-live with the new code. And while there were a couple of scary moments, everything went pretty smoothly -- in particular, the big copy took 75 minutes, almost exactly what we'd expected.
So as of 17 March, we've been running on ext4.
Since we went live, we've run two main tests.
First, and most importantly, we've tested our backups much more thoroughly than before. We've gone back to the old backup technique -- on the backup server, shut down the drbd connection, snapshot the disks, and restart drbd -- but now we're using ext4 as the filesystem. And we've confirmed that our new backups can be re-mounted, they have working quotas, and we can rsync all of their data over to fresh disks without errors. So that's reassuring.
Secondly, we've taken the old XFS volumes and tried to recover the quotas. It doesn't work. The data is all there, and can be rsynced to fresh volumes without IO errors (which means that at no time was anyone's data at risk). But the project quotas are irrecoverable.
We've also (before we went live with ext4, but after we'd committed to it) discovered that there was a bug in XFS -- fixed in Linux kernels since 3.17, but we're on Ubuntu Trusty, which uses 3.13. It is probably related to the problem we're seeing, but certainly doesn't explain it totally -- it explains why a quotacheck ran when we re-mounted the volumes, but doesn't explain why it never completed, or why we were never able to re-mount the volumes with quotas enabled.
Either way, we're on ext4 now. Naturally, we're 100% sure it won't have any problems whatsoever and everything will be just fine from now on ;-)