Detecting leaked Activities in Android

Posted by

So War Worlds has, for quite some time, been plagued with various memory issues. One of the problems is that Android in general is pretty stingy with the heap memory it gives to programs, but the other is that it's actually very easy to "leak" large chunks of memory by accidentally leaving references to Activity objects in static variables.

There's not much that can be done about the former, but the latter can be fairly easily debugged using a couple of tools provided by Android and Java.

Detecting leaked Activties

The first step is figuring out when Activties are actually leaked. In API Level 11 (Honeycomb), Android added the detectActivityLeaks() method to StrictMode.VmPolicy.Builder class. All this method does is cause Android to keep an internal count of the number of instances of each Activity class that exist on the heap, and let you know whenever it hits 2.

When you get a StrictMode violation, there's a few things you can do, you can have Android kill your process, you can log a message, or you can do both. To be honest, neither of those things is particularly helpful because when you've got a leak what you really want is a dump of the heap so that you can analyse it later.

Luckily for us, the StrictMode class writes a message to System.err just before it kills the process, so we can do this little hack to hook in just before that happens and get a HPROF dump just in time:

private static void enableStrictMode() {
  try {
    StrictMode.setVmPolicy(new StrictMode.VmPolicy.Builder()
        .detectActivityLeaks().detectLeakedClosableObjects()
        .detectLeakedRegistrationObjects().detectLeakedSqlLiteObjects()
        .penaltyLog().penaltyDeath().build());

    // Replace System.err with one that'll monitor for StrictMode
    // killing us and perform a hprof heap dump just before it does.
    System.setErr (new PrintStreamThatDumpsHprofWhenStrictModeKillsUs(
        System.err));
  } catch(Exception e) {
    // ignore errors
  }
}

private static class PrintStreamThatDumpsHprofWhenStrictModeKillsUs
    extends PrintStream {
  public PrintStreamThatDumpsHprofWhenStrictModeKillsUs(OutputStream outs) {
    super (outs);
  }

  @Override
  public synchronized void println(String str) {
    super.println(str);
    if (str.startsWith("StrictMode VmPolicy violation with POLICY_DEATH;")) {
      // StrictMode is about to terminate us... do a heap dump!
      try {
        File dir = Environment.getExternalStorageDirectory();
        File file = new File(dir, "strictmode-violation.hprof");
        super.println("Dumping HPROF to: " + file);
        Debug.dumpHprofData(file.getAbsolutePath());
      } catch (Exception e) {
        e.printStackTrace();
      }
    }
  }
}

Essentially, what we do is, we set up StrictMode to kill us, the replace System.err with a hack subclass of PrintSteam that checks for the message StrictMode prints and generates a heap dump right then and there.

It's ugly, but we only have to do it in debug mode anyway, so I think it's OK.

Analysing the dump

Next, we have to copy the .hprof file off the device (you could use adb pull for that, but I like ES File Explorer). One important step is converting the .hprof file from Dalvik format to normal Java format. There's a handy-dandy tool called hprof-conv in the Android SDK which does the conversion for you.

Once you've got a HPROF file a format that can be read by the standard Java tools, the next step is actually analysing it. I was using the Eclipse Memory Analyzer tool, but anything will do, really. The first thing we need to do is find out which objects are leaking. Luckily for us, StrictMode prints out which activity is leaking, so we can start there:

01-15 17:24:23.248: E/StrictMode(13867): class au.com.codeka.warworlds.game.EmpireActivity; instances=2; limit=1
01-15 17:24:23.248: E/StrictMode(13867): android.os.StrictMode$InstanceCountViolation: class au.com.codeka.warworlds.game.EmpireActivity; instances=2; limit=1

So it's detected two copies of the EmpireActivity class. We fire up the Memory Analysis tool in Eclipse, and go to "List objects ... with incoming references" and enter our offending class, "au.com.codeka.warworlds.game.EmpireActivity". And just as StrictMode said, there's our two instances:

Now, how do we determine where they're being held? Right-click one of them and select "Path to GC Roots ... with all references", which will bring up something like this:

If we go with the assumption that the leak happens in our code and not in the framework (usually a pretty good assumption) the only object holding a reference to this activity is our FleetList class. If we expand that out a few times, we can follow the references all the way back to the mSpriteGeneratedListeners member of StarImageManager.

What the StarImageManager class does is it generates (or loads from disk) the bitmaps used to display stars. It has a list of objects that are interested in knowing when the star images are updated (generating them can take a while). What was happening here is that the FleetListAdapter class was adding itself to the list of classes interested in receiving updates from StarImageManager, but never removing itself from that list.

By fixing that bug, I managed to squash one the biggest memory leaks in the game.

Further work

One thing this has highlighted to me is that my system of highly manual event pub/sub is probably not a great idea. Scattered all throughout the code are lots of these "lists of classes who want to subscribe to my event". Management of those lists is all manual, firing the events is manual, and it's actually been a bit of a source of problems in the past (e.g. with events firing on random threads and whatnot).

So the next big chunk of work will be replacing all of these with some kind of unified EventBus pattern. I quite like the implimentation of code.google.com/p/simpleeventbus, in particular I like how it uses WeakReferences to hold the event subscribers (which would make the above memory leak non-existent).

Unicode support in MySQL is ... 👎

Posted by

For the last few days, I've been getting some strange error reports from the War Worlds server. Messages like this:

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xB8. ...' for column 'message' at row 1
   at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)
   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4120)
   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4052)
   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2503)
   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2664)
   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2815)
   at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
   at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2458)
   at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2375)
   at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2359)
   at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:203)
   at au.com.codeka.warworlds.server.data.SqlStmt.update(SqlStmt.java:117)
   at au.com.codeka.warworlds.server.ctrl.ChatController.postMessage(ChatController.java:120)
   . . .

Now, that string which MySQL complains is "incorrect" is actually the Unicode codepoint U+1F638 GRINNING CAT FACE WITH SMILING EYES, aka 😸 -- a perfectly valid Emoji character. Why was MySQL rejecting it? All my columns are defined to accept UTF-8, so there should not be a problem, right?

When is UTF-8 not UTF-8?

When it's used in MySQL, apparently.

For reasons that completely escape me, MySQL 5.x limits UTF-8 strings to U+FFFF and smaller. That is, the "BMP". Why they call this encoding "UTF-8" is beyond me, it most definitely is not UTF-8.

The trick, apparently, is to use a slightly different encoding which MySQL calls "utf8mb4" which supports up to 4-byte UTF-8 characters.

So the "fix" was simple, just run:

ALTER TABLE chat_messages
   MODIFY message TEXT CHARACTER SET utf8mb4
                       COLLATE utf8mb4_unicode_ci NOT NULL;

And so on, on basically every column in the database which could possibly include characters outside the BMP. But that's not enough! You also need to tell the server to use "utf8mb4" internally as well, by including the following line in your my.cnf:

[mysqld] ​character-set-server = utf8mb4

Now presumably there is some drawback from doing this, otherwise "utf8mb4" would be the default (right?) but I'll be damned if I can figure out what the drawback is. I guess will just moniter things and see where it takes us. But as of now, War Worlds support Emoji emoticons in chat messages, yay!

Addendum

If you're just seeing squares for the emoji characters in this post, you'll need a font that supports the various unicode emoji blocks. I've used the Symbola font (which you can get here) with good results.

Why you should use a reputable DNS registrar

Posted by

I've had a bit of a crazy weekend this weekend. It all started while I was checking out some of my websites and I discovered their DNS was not resolving! That's always scary, and after a few minutes I realised that the DNS servers were unreachable from some (but not all) networks. For example, this was a traceroute to one of the DNS servers from my home computer:

$ traceroute 103.30.213.5
traceroute to 103.30.213.5 (103.30.213.5), 30 hops max, 60 byte packets
 1  home.gateway.home.gateway (192.168.1.254)  0.745 ms  1.041 ms  1.334 ms
 2  lns20.syd7.on.ii.net (150.101.199.219)  18.276 ms  19.705 ms  20.402 ms
 3  te3-1-120.cor2.syd6.on.ii.net (150.101.199.241)  22.902 ms  23.762 ms  24.736 ms
 4  ae5.br1.syd7.on.ii.net (150.101.33.50)  138.154 ms  139.785 ms ae0.cr1.syd4.on.ii.net (150.101.33.16)  29.454 ms
 5  ae5.br1.syd4.on.ii.net (150.101.33.48)  30.446 ms ae0.br1.syd4.on.ii.net (150.101.33.14)  36.895 ms ae5.br1.syd4.on.ii.net (150.101.33.48)  37.306 ms
 6  te0-2-0.bdr1.hkg2.on.ii.net (150.101.33.199)  145.732 ms  139.403 ms  140.845 ms
 7  hostvirtual-RGE.hkix.net (202.40.160.179)  159.664 ms  132.746 ms  133.402 ms
 8  vhk.vr.org (208.111.42.5)  134.369 ms  135.393 ms  136.574 ms
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
. . .

It was getting stuck at vhk.vr.org, which seems to be some transit provider or other. But from other networks it was OK, here's the traceroute from my Google Compute Engine instance:

$ traceroute 103.30.213.5
traceroute to 103.30.213.5 (103.30.213.5), 30 hops max, 60 byte packets
 1  216.239.46.192 (216.239.46.192)  1.168 ms 216.239.43.216 (216.239.43.216)  1.429 ms 216.239.46.192 (216.239.46.192)  1.105 ms
 2  216.239.46.192 (216.239.46.192)  1.383 ms  1.072 ms 216.239.43.218 (216.239.43.218)  1.335 ms
 3  216.239.46.190 (216.239.46.190)  1.477 ms 216.239.43.218 (216.239.43.218)  1.367 ms 216.239.46.192 (216.239.46.192)  1.287 ms
 4  216.239.43.216 (216.239.43.216)  1.656 ms 216.239.46.190 (216.239.46.190)  1.434 ms 216.239.43.218 (216.239.43.218)  1.914 ms
 5  216.239.43.216 (216.239.43.216)  1.910 ms  1.886 ms  1.868 ms
 6  216.239.43.218 (216.239.43.218)  1.841 ms  1.257 ms 216.239.46.190 (216.239.46.190)  1.470 ms
 7  216.239.46.192 (216.239.46.192)  1.196 ms 216.239.43.218 (216.239.43.218)  1.289 ms 216.239.43.216 (216.239.43.216)  1.496 ms
 8  216.239.43.218 (216.239.43.218)  1.643 ms 216.239.43.216 (216.239.43.216)  1.457 ms 216.239.43.218 (216.239.43.218)  1.586 ms
 9  209.85.248.215 (209.85.248.215)  11.936 ms 72.14.232.140 (72.14.232.140)  11.761 ms 209.85.248.229 (209.85.248.229)  14.151 ms
10  209.85.254.239 (209.85.254.239)  15.746 ms 72.14.237.132 (72.14.237.132)  11.702 ms 72.14.237.131 (72.14.237.131)  14.216 ms
11  209.85.255.133 (209.85.255.133)  14.355 ms 209.85.255.27 (209.85.255.27)  14.187 ms  13.759 ms
12  * * *
13  db-transit.Gi9-12.br01.rst01.pccwbtn.net (63.218.125.26)  39.572 ms  37.646 ms  39.849 ms
14  viad-vc.as36236.net (209.177.157.8)  39.499 ms  37.372 ms  39.243 ms
15  pdns1.terrificdns.com (103.30.213.5)  40.454 ms  39.940 ms  41.335 ms

So that seems kind of scary! I quickly composed an email to the support address with what I'd seen. An hour later and no response. At this point I was worried, because I had no idea how many people were unable to contact my websites. I tried logging in to the management console, but that was also a no-go: the DNS for the management console was hosted on the same DNS servers that were not responding! I managed to log in by hard-coding the IP address (which I got from my GCE server that was able to connect to the DNS servers) in my /etc/hosts file.

Just trying to get something up & running again, I exported all my DNS records and imported them into a friend's DNS server. Then I was able to change the nameservers configured in the management console to point to my friend's DNS server. For now, we were back up again. But a few hours later and still no response from my support request.

Who are NameTerrific?

I first heard about NameTerrific in this post on HackerNews. The website was well-done, the interface was easy to use. So I decided to give it a go with some of my more throwaway domains. I had a minor issue early on, and the support was quite good and over the next year or so, I started to move a couple more domains over.

Then, I stopped thinking about it. It worked well for the next year or so, and DNS tends to be one of those things you don't really think about. Until it stops working.

It was probably my fault. I didn't put much effort into researching NameTerrific's founder, Ryan Zhou. He seems to be a serial entrepreneur who dropped out of school to persue his dream. That's all well and good, but when you're hosting a service for people, it doesn't do much for your reputation as a reliable investment when you abandon your previous business for who-knows-what-reason.

What do I think happened?

 I think he discovered his website had bugs and people's domains were being transferred to the wrong people. Why do I suspect that? Because it's happening to me right now. Check this out. I go into my control panel for one of my domains, war-worlds.com:

I click on "Transfer Away", it prompts me to confirm, I click OK and I receive the following email:

It's an authorization code for a domain I don't even own. What's worse, if I do a whois on mitchortenburg.com, I find myself listed in the contact information! I seem to be the owner of a domain I never purchased (and, to be honest, don't really want) because of some bizarre mixup with the management website.

Even worse still: I have no way to generate an auth code for war-worlds.com (the domain I do care about) and I'm terrified that some other customer of NameTerrific's is somehow able to do what I've managed to do and gain ownership of my domain!

I'm not the only one having problems. Their facebook page is full of people who have also apparently realized they've been abandoned:

This is not how you run a busines. If you find bugs in your management console, you don't abandon your business and leave your customer in the lurch.

Now it seems Ryan has at least learnt from this failure to not let people post on his business' facebook wall. His "CoinJar" business has disabbled wall posts entirely.

What are my options?

So I currently have a ticket open with eNom (NameTerrific were an eNom reseller) and I hope I can get control of my domain back again. And if Mitch Ortenburg is reading this, I'm also quite happy to return your domain to you, if I only knew how to contact you...

And of course, I have already transferred every domain I can out of NameTerrific and into a reputable registrar. Lesson learnt!

I don't think Samsung Mob!lers threatened to strand bloggers at Berlin

Posted by

So I was reading this post on Hacker News wherein some bloggers were apparently threatened with cancelled flights home after they refused to man Samsung booths at a trade show.

Now, to me, this seemed rather fishy. Why would Samsung do something so obviously malicious? I had a look at the Samsung Mob!lers website, and it's pretty clear how the site works. You write articles on topics Samsung wants you to write about (there's currently one up there now asking bloggers "to showcase the extent of its [S-Voice] functionality in a fun, adventurous and creative way.") In return for writing blog posts on behalf of Samsung, you earn "points" which you can then redeem for "rewards".

Here's the page that lists a trip to the IFA consumer show as a "reward" in the Mob!lers program. So let's be clear here: the trip to the IFA consumer show was a reward for writing blog posts praising Samsung products. This is not a case of Samsung inviting bloggers to trade shows to provide coverage, it's a prize.

The TNW article goes to great pains to point out that sending bloggers to trade shows is normal fare for large companies. But that's a red herring -- these are not bloggers invited to the event to provide coverage for Samsung. They were invited to the event as a reward for writing blog posts about Samsung products. The article also goes to pains to point out that programs like Samsung's Mob!lers are common. But it doesn't go into any details about what the program entails (except to say that the Mob!lers program expects you to become a "shill" for the company).

So here's what I think happened: I think these bloggers earned their points on Samsung's Mob!lers website. In return, Samsung rewarded them with a free trip to Berlin to the IFA trade show. My guess is that Samsung made it pretty clear they would be manning Samsung booths at the show (after all, that was the "warning signs" the article talks about: fitting for uniforms and such). The bloggers probably told Samsung that they wanted to review other products while they were there and Samsung said that's fine.

Now we get to the "smoking gun" email that TNW posts. In light of the above, all the email says is that they refused to go the Samsung event and just stayed in their hotel, and now Samsung wants to send them home. It doesn't say they weren't told they would have to man the booths. It doesn't say that flights had been cancelled. Just that they want to send them home early.

And if they're not doing what they were asked to do, then why wouldn't you want to send them home early? All we have on the "cancel flights" and "stranding" is the word of the bloggers who clearly weren't interested in doing their part.

Debugging an issue with high CPU load

Posted by

For the last few days, one of the websites I work on was having a strange issue. Every now and then, the whole server would grind to a halt and stop serving traffic. This would last for a couple of minutes, then suddenly everything would "magically" pick up and we'd be back to normal. For a while...

Simple website monitoring

The first part to figuring out what was going on was coming up with a way to be notified when the issue was occuring. It's hard to debug these kind of issues after the fact. So I wanted to set up something that would ping the server and email me whenever it was taking too long to respond. I knocked up the following script file (I might've inserted backslashes at invalid locations, sorry, that's just to ease reading on the blog):

#!/bin/bash

TIME=$( { /usr/bin/time -f %e wget -q -O /dev/null \
          http://www.example.com/; } 2>&1 )
TOOSLOW=$(awk "BEGIN{ print ($TIME>2.5) }")

if [ $TOOSLOW -eq 1 ]; then
  echo "The time for this request, $TIME, was greater than 2.5 seconds!" \
    | mail -s "Server ping ($TIME sec)" "me@me.com"
fi

I set this up as a cron job on my media centre PC (high-tech, I know) to run every 5 minutes. It would email me whenever the website took longer than 2.5 seconds to respond (a "normal" response time is < 0.5 seconds, so I figured 5 times longer was enough).

It didn't take long for the emails to start coming through. Then it was a matter of jumping on the server and trying to figure out what the problem was.

First steps

Once the problem was happening, there's a couple of "obvious" first things to try out. The first thing I always do is run top and see what's happening:

top - 08:51:03 up 73 days,  7:45,  1 user,  load average: 69.00, 58.31, 46.89
Tasks: 316 total,   2 running, 314 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.0%us,  1.3%sy,  0.0%ni, 15.2%id, 72.0%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:   8299364k total,  7998520k used,   300844k free,    15480k buffers
Swap: 16779884k total,     4788k used, 16775096k free,  6547860k cached

Check out that load! 69.00 in the last five minutes, that's massive! Also of concern is 75% next to "wa", which means 75% of the CPU time was spend in an uninterruptable wait. There's not many things that run in uninterruptable wait (usually kernal threads), and it's usually some I/O sort of thing. So lets see what iotop (which is like top but for I/O) reports:

Total DISK READ: 77.37 K/s | Total DISK WRITE: 15.81 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
25647 be/4 apache     73.50 K/s    0.00 B/s  0.00 % 99.99 % httpd
24387 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
23813 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [pdflush]
25094 be/4 root        0.00 B/s    0.00 B/s 96.72 % 99.99 % [pdflush]
25093 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
25095 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
25091 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [pdflush]
24389 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
24563 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
24390 be/4 apache      0.00 B/s   23.21 K/s 96.71 % 99.99 % httpd
24148 be/4 apache      0.00 B/s    0.00 B/s 96.71 % 99.99 % httpd
24699 be/4 apache      0.00 B/s    0.00 B/s 99.99 % 99.99 % httpd
23973 be/4 apache      0.00 B/s    0.00 B/s 99.99 % 99.99 % httpd
24270 be/4 apache      0.00 B/s    0.00 B/s 99.99 % 99.99 % httpd
24298 be/4 apache      0.00 B/s 1918.82 K/s 96.71 % 99.02 % httpd
  628 be/3 root        0.00 B/s    0.00 B/s  0.00 % 97.51 % [kjournald]
25092 be/4 root        0.00 B/s    0.00 B/s  0.00 % 96.72 % [pdflush]
24258 be/4 root        0.00 B/s    0.00 B/s 99.99 % 96.71 % [pdflush]
23814 be/4 root        0.00 B/s    0.00 B/s  0.00 % 96.71 % [pdflush]
24388 be/4 root        0.00 B/s    0.00 B/s 99.02 % 96.71 % [pdflush]
25545 be/4 apache      0.00 B/s    0.00 B/s  0.19 % 92.73 % httpd
25274 be/4 apache      0.00 B/s    0.00 B/s  0.00 % 92.38 % httpd
24801 be/4 apache      0.00 B/s    5.84 M/s 99.99 % 91.63 % httpd
25281 be/4 apache      0.00 B/s    5.75 M/s  0.00 % 91.33 % httpd
26115 be/4 apache      0.00 B/s    0.00 B/s  9.60 % 19.26 % httpd
25561 be/4 apache      0.00 B/s    3.87 K/s  0.00 %  9.66 % httpd
26035 be/4 apache      0.00 B/s    0.00 B/s  0.00 %  9.63 % httpd

So all those pdflush commands looked suspicious to me. pdflush is a kernel thread that's responsible for writing out dirty pages of memory to disk in order to free up memory.

It was at this point that I was suspecting some kind of hardware failure. Checking the output of sar -d 5 0 I could see this:

Linux 2.6.18-308.1.1.el5PAE (XXX)  23/08/12

08:55:45          DEV       tps  ...     await     svctm     %util
08:55:50       dev8-0    877.25  ...    179.28      1.14     99.84
08:55:50       dev8-1      0.00  ...      0.00      0.00      0.00
08:55:50       dev8-2      0.00  ...      0.00      0.00      0.00
08:55:50       dev8-3    877.25  ...    179.28      1.14     99.84

Check out that utilization column! 99.84% is really bad (more than 70% or so is when you'd start to have problems)

I was at a bit of a loss, because I'm not too familar with the hardware that's running this server (it's not mine) but I knew the disks were in hardware RAID and smartctl wasn't being helpful at all, so I posted a question on Server Fault. At this point, I was thinking it was a hardware problem, but I wasn't sure where to go from there.

My first hint

My first hint was a comment by Mark Wagner:

Apache PIDs 24801 and 25281 are doing by far the most I/O: 5.84 M/s and 5.75 M/s, respectively. I use iotop -o to exclude processes not doing I/O

What were those two processes doing? I opened one of them up with strace:

# strace -p 24801
[sudo] password for dean: 
Process 24801 attached - interrupt to quit
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304

It was just spewing those lines out, writing huge amount of data to file number "26". But what is file number 26? For that, we use the handy-dandy lsof:

# lsof -p 24801
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF      NODE NAME
httpd   24801 apache  cwd    DIR    8,3     4096         2 /
httpd   24801 apache  rtd    DIR    8,3     4096         2 /
httpd   24801 apache  txt    REG    8,3   319380   6339228 /usr/sbin/httpd
. . .
httpd   24801 apache   26w  0000    8,3        0  41713666 /tmp/magick-XXgvRhzG

So at the end of the output from lsof, we see under the "FD" (for "file descriptor") column, 26w (the "w" means it's open for writing) and the file is actually /tmp/magick-XXgvRhzG.

A quick ls on the /tmp directory, and I'm shocked:

-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXrQahSe
-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXTaXatz
-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXtf25pe

These files are 1.6 terrabytes! Luckily(?) they're sparse files so they don't actually take up that much physical space (the disks aren't even that big), but that's definitely not good.

The last piece of the puzzle was figuring out what images were being worked on to use those enormous temporary files. I was thinking maybe there was a corrupt .png file or something that was causing ImageMagick to freak out, but firing up Apache's wonderful mod_status and I could immediately see that the problem was my own dumb self: the URL it was requesting was:

/photos/view/size:320200/4cc41224cae04e52b76041be767f1340-q

Now, if you don't spot it right away, it's the "size" parameter: size:320200, it's supposed to be size:320,200 - what my code does if you leave off the "height" is it assumes you want to display the image at the specified width, but with the same aspect ratio. So it was trying to generate an image that was 320200x200125, rather than 320x200!

The Solution

The solution was, as is often the case, extremely simple once I'd figured out the problem. I just made sure the image resizer never resized an image to be larger than the original (our originals are generally always bigger than what's displayed on the website).

The only remaining question was where was this request coming from? The output of mod_status showed an IP address that belonged to Google, so it must've been the Googlebot crawling the site. But a quick search through the database showed no links to an image with the invalid size:320200 parameter.

At this stage, it's still an open mystery where that link was coming from. The image in question was from an article written in 2010, so it's not something current. In any case, where the link was coming from is of less concern than the fact that it no longer causes the whole server to grind to a halt whenever someone requests it.

So I'm happy to leave that one open for now.