Dean's Blog

(Somewhat) efficient meshing in a minecraft-style world

Fri, 08 Dec 2017 12:48:00 GMT

So I may be a bit late to this party, but I've decided to build a minecraft-like game. I'm not 100% about the exact format, but my thinking is it'll be an FPS shooting with destructible terrain. I envisage that at the end of a round, you'll just be sitting in a crater on top of a pile of bodies... anyway, I've started off with a much simpler problem: how do you turn a 3D voxel array into an efficient mesh?

To begin with, I've started with a very simple data structure, just a single bit for "solid" vs. "empty". I'm working with a 32x32x32 chunk of voxels. I'll expand on that later to support textures and different materials and so on, but I think the general principle is the same.

The first version of my algorithm was very simple: just render a 6-sided cube for each solid voxel:

This is obviously horribly inefficient (it's actually closer to 300k vertices, despite what the display says there: it's doubled because the wireframe causes the vertices to be counted twice -- still, 300k is insane!)

This algorithm is so dumb, I'm not even going to bother posting it. Now, the second version of my alorithm was actually a very minor modification. For each face of each cube, it would look at the neighboring voxel. If the neighbour had the same value (i.e. if they were both solid), it would simply skip that face. The results look much nicer, even with the noise of the wireframe:

OK, so this ends up being about 17k vertices (without the wireframe) which seems pretty reasonable. But you can see in this view that there's actually quite a lot of redundant quads being drawn.

I found a pretty nice algorithm from a few years ago titled "Meshing in a Minecraft game". It was a little math-y for me, so it took me a little while to figure out exactly how it's supposed to work. I prefer to think in terms of pictures. Anyway, this is the result:

Pretty neat! Down from 300k vertices to 3,000. 100x improvement! It's still not 100% optimal, but it's close enough that I don't really care.

OK, so how does it work? The first step is sweeping through the volume one axis at a time. I actually sweep through each axis twice, once for each direction. Then, for each sweep, you generate a bunch of "slices", and populate the slices with quads for each side of the cube on that slice. Here's an example of what you see after I've sweeped through the Y axis only and just rendered all of the quads:

Now, for each slice we end up with a 2D array of the quads and all we want to do is make as few larger quads as possible. Here's a little animation to show how I'm doing it:

As you can see, we just iterate through the slice, find the first quad and then try to expand it to be as big as possible (going as far to the left as we can, then taking it as far down as we can).

As I said, it's not the most efficient algorithm (as in, it doesn't generate the absolute most optimal mesh), but it's pretty easy to understand and it gets us pretty far along. I'm pretty happy with the result overall!

March 2016 gameplay update

Sat, 05 Mar 2016 13:13:13 GMT

A lot has happened in the last month. Here's the latest gameplay video:

Here's a quick rundown of the changes:

Particle effects for entities. You can see this in the blacksmith smoke and fire towards the end.
Clothes/armour (though only clothes implemented for now, armour is the same: just a different sub-model). This is actually somewhat tricky, the way it works is, we have a 'base' mesh that contains a skeleton and the animations, then we have additional meshes for feet, hands, torso, legs and head that attach to that skeleton. So I can replace each of those components idependently of the others.
The inventory system actually lets you swap items out and it updates the preview at the same time.

So that's it for now. This month I plan to work on dungeons and dungeon-generator.

Client/server architecture

Sat, 05 Mar 2016 02:13:00 GMT

In this post, I'm going to describe the client/server "architecture" of my as-yet unnamed RPG, such as it is. The basic premise of the game is that it's an open-world, browser-based role-playing game. My expectation is that there could be ~100 players online at a time, and the game should be able to reasonably handle that on a single server. Currently I have no plans for how to handle expanding the game past a single server, apart from some vague ideas about splitting the world geographically.

As mentioned in my last post, the game runs in a browser, so the client is JavaScript. I've chosen to use Go for the server, mainly because I wanted to try something a bit different, and I like Go's concurrency model. So why not?

Overview

With that said, here's a picture of how the server looks today:

From top-to-bottom we have:

The actual game clients, running JavaScript in a browser. These connect via a persistent WebSocket to the server.
In the server's main process, one "player handler" exists for each connected player. The player handler keeps a reference to the player's entity in the world.
The "world" is basically a collection of entities and some supporting data structures (e.g. pathing objects for constructing paths, etc) and a bunch of goroutines acting on those entities.
The "long-term storage" (or just "Store" as it's called in the code) is where we store all the things which need to persist between server restarts (things like player stats, inventory, quest status and so on)
The "sectors" collection is a bunch of static files which describe a segment of the world. These are 64x64 grids that contain information about the terrain, the entities that should be initially created on the terrain and so on.

Entities

Entities are simply a collection of "components". A component provides the basic functionality of the entities, and can be combined to form any kind of complex entity we like. The components used on the server and client are not always the same (for example, the "Model" component is ignored on the server, but it's obviously very important on the client!)

Some of the components we have defined are described below.

Position - This contains the coordinates of the entity. It also handles things like "follow terrain" which causes an entity to "stick" to the terrain, or allows it to float above the terrain.
Movement - For entities which can move, this handles finding paths (on the server) and following paths (on the client).
Model - On the client, this is used to control the model that the entity is displayed as. Things like animations are controlled by this component as well.
Mob - This component controls the "AI" of mobs. When not attacking a player, a mob will wander around at random. When attacked by a player, the mob will attempt to fight back.
Stats - This component manages the entity's stats (health, strength, etc).
Spawner - This component is used by entities on the server to spawn other entities. It's usually invisible and lets you control things like "spawn X kind of entity, making sure there's never more than N entities active at once).

Sectors

The world is divided up in 64x64 grids, called "sectors". The sectors are all adjacent to each other, and the client will load up to four sectors to ensure that the world appears seemless as you move around. On the server, sectors aren't really used at runtime, except that the data within them is loaded on demand the first time a client enters that sector.

For example, the first time a player enters a particuar sector, the server will load all the data for that server in to memory (for example, there is an entities.json file which contains definitions for all of the entities which exist in that sector (for example, this will describe environmental objects -- trees, rocks etc, as well as things like spawners and NPCs to populate the world with life.

Long term storage

Long-term storage (or just the "Store") is used to store all the data which needs to persist over server restarts. Things like player stats, inventory, quest status and so on. Every time one of these things is modified in the world, the store is automatically and immediately updated as well.

It's important to note that entities in the game are not nessecarily stored in the persistent store. We generally assume that if the server restarts, then when the sector is reloaded, all the entities we care about will be re-created from the sector data.

Client/server protocol

As mentioned above, the client maintains a persistent WebSocket connection to the server. If the WebSocket connection drops out, the client pops up a "waiting to reconnect" dialog and tries to re-connect.

The above shows Chrome's debug tool showing the frames sent over the WebServer to the server (and the data received back from it). Data is encoded using JSON (for now, I might switch to a binary format if size/bandwidth becomes a problem).

Basically, on the server there is a player handler which handles decoding the packets from the client. It maintains a reference to the player's entity in the "world", and whenever a packet instructs it to do something, the entity within the world is notified. For example, when the player clicks on another entity, the player handler will issue an "activate" event on the entity you clicked on. Depending on whether you clicked an NPC or a mob, the NPC might send a "show quest page" packet back to the client. Or if it's a mob, an "entity attacked" packet might be sent to all players within visual range.

Conclusion

That's a basic overview of the architecture of the game, such as it is today.

Obviously this is all quite up in the air, and I haven't really arrived here through considered, careful thought, but more through an organic process of evolution.

February gameplay update

Wed, 10 Feb 2016 02:00:00 GMT

Here's the latest update of where I'm at with my still as-yet-unnamed RPG. First, a video:

As you can see, there's a bunch of new features. In no particular order:

Chat works and looks much nicer. You can actually see who's chatting with you, for example.
Quest management: when you start a quest, an entry gets added to the "Quests" menu and you can see what tasks you need to do.
Inventory management: you can open up your inventory and see what you've got equipped.

The unnamed RPG

Wed, 13 Jan 2016 02:46:00 GMT

This post provides a bit of an overview of my plans for my as-yet unnamed RPG. One of the biggest lessons I learned with my previous game (which I talked a bit about here) is that players want to interact. The in-game chat was pretty much an afterthought in my previous game (in fact, it was the very last feature I added before going to "alpha"). Most of the work I did post-release was community-based features -- an alliance system, one-on-one and group private chats, and so on.

So I wanted to make sure my new game supported a vibrant community right from the beginning. Also, I was a little tired of the limitations of working with a phone, so I wanted something that ran on a PC. But more than that, I wanted something that was cross-platform (I personally use a Mac laptop and a Linux workstation, so it needed to run on those platforms as well as Windows, of course). Because I'm a masochist, I decided make the game browser-based.

The first iteration I was using a <canvas> directly and drawing 2D sprites. It worked pretty well, but performance was generally pretty horrible -- and that was with practically nothing on the screen! I actually got pretty far with that 2D engine, including multiple players, chat, animations and so on. Here's a video:

That video didn't have pathfinding, but that was also added before I decided that the performance just wasn't there. My plan was going to be to use webgl to speed up the 2D engine, but as I was coding, this happened

So I made it fully 3D with three.js as the backend renderer. In fact it's actually not that hard to get webgl up and running -- the hardest part, at least for me, was content. The initial "rewrite" was still talking to the same backend server, so it still supported all those wonderful multiplayer features, but I had an MD2 model (from Quake) as my main character, and some procedurally generated tree I found somewhere as decoration.

The most interesting thing about the rewrite is the terrain. I found this cool article on Gamasutra and implemented it as a GLSL shader in three.js.

Then I taught myself how to use blender to create some basic 3D objects and animations. All the content in the current gameplay video was made by me in blender. Yay!

So that's where we are today. I've have a few ideas for the basic premise behind the game and also some information on how the game has been "architectured" (I put it in quotes because it's much more organically developed than architectured...)

What's the deal with USB headphones?

Sun, 22 Feb 2015 04:21:59 GMT

Why do these things exist? A while back, I bought a pair of Logitech H390 headphones. This is what they look like:

When I bought them, I didn't really spend a whole lot of time looking at the box, but the fact that the interface is USB is shown in that little info panel on the bottom-left. In retrospect, I probably should have spend more time reading that box because it turns out, USB is a horrible, weird interface for headphones and I can't for the life of me understand why they exist.

Cons of USB headphones

These are just the ones I ran into while using my headphones over the last year or so:

They only work in a couple of places. PC, sure. Maybe I could plug them into my Android phone with a USB-A to micro-USB adapter, but I never bothered to buy one in case it didn't work. A Ninentdo DS or practically anything else that's not a PC? Forget it!
Even when it works, because the headset shows up as a separate device, you can get yourself into weird situations where audio would continue routing to your main speakers and not the headphones. Both Windows (as of version 7, I haven't tried newer versions) and Linux have fairly clunky interfaces for choosing which output device to use. In most cases, with regular headphones, once you plug in all audio is automatically routed to the headphones.
Because the DAC happens inside the headphones, you can be pretty sure the quality of that hardward is nowhere near the quality of the hardware in your desktop computer.

Pros of USB headphones

¯\_(ツ)_/¯

Oh wait, I can think of one "pro". The Creative Tactic3D Rage headphones has illuminated ear cups. For some reason.

Connecting from your Android device to your host computer via adb

Sat, 22 Nov 2014 12:45:00 GMT

Sometimes when I'm on the road, and using someone else's Wi-Fi (typically when I'm in a hotel), they will allow you to connect your phone and your laptop to the internet via their Wi-Fi. But, for security reasons, they typically won't allow your phone to connect to your computer over the Wi-Fi network. This make debugging certain applications quite difficult: I can't run a server on my laptop and have my phone connect to it.

But luckily, in recent versions of adb, a new command has been added: adb reverse. It's basically the opposite of adb forward, which lets you forward ports from the host to your device, adb reverse allows you to forward ports from your device to your host.

So let's say I have a server running on my host computer, listening on port 8080. I want my phone to connect to it, so I run:

$ adb reverse tcp:8080 tcp:8080

And then connect to 127.0.0.1:8080 on the phone and problem solved!

Like adb forward, the parameters correspond to server/destination ports. But in the case of adb reverse, the first parameter is the port on the device and the second is the port on the host you want to forward to.

A battery widget with support for monitoring your wearable battery level

Tue, 09 Sep 2014 06:59:00 GMT

I released the first version of "Advanced Battery Monitor" towards the end of last year, to not much fanfare at all. I used it and a couple of my friends used it, and that's about it.

But since getting my hands on an Android Wear watch (Moto 360, to be precise), I added a new feature to my graph to also monitor the battery on the watch.

Here you can see my phone + watch battery level thoughout the day today:

The green line represents my phone's battery, the blue my watch's. I use both devices fairly heavily throughout the day, so I'll let you make of these graphs as you will.

The other feature I've added is the ability to export your battery data. The app only keeps battery data for up to a week, but you can export the whole data by tapping the widget to get into the "settings" and using the Export option.

The source code for the widget is available on github: https://github.com/codeka/advbatterygraph

And of course, the widget itself can be downloaded from the Play Store:

Playing around with a random level generator

Mon, 25 Aug 2014 04:09:00 GMT

So, one of the things I've been toying with recently is the idea of a Diablo-style hack & slash. One of the most important aspects of the hack & slash is level generation. There's lots of ways you can generate levels and lots of different "looks" you can give your levels. I plan to explore a few different techniques over the next few weeks. My first attempt, which I've included below, gives a sort of cavernous look that I think would be good for boss levels:

The technique is quite simple, I "borrowed" the idea from this page. Basically, the algorithm runs as:

Fill the board with random walls. Each square starts with a 45% chance of being filled.
Each iteration, for each square, count the number of walls around that square. If there is more than 5, fill the wall in a new version of the map.
Repeat step 2 until the map is "stable" (that is, until two iterations produce the same map).

In the example above, you just let it run normally, it'll do a few iterations, then complete the run in one go and pause for a few seconds so you can see the outcome. You can click "stop" then use the "reset" and "step" buttons to manually progress through the iterations.

There's a couple of problems with this, though. Firstly, it tends to make very open levels. I guess depending on the game, that may or may not be OK, but for our purposes, I think I'd like something a little more close-quarters. Another problem is that it produces overly "smooth" walls, that is, there are no sharp corners and no sharp points. Again, depending on your level design, that may or may not be OK, but I'd like to do something about it.

For the second problem, the solution is actually quite simple: just don't run the algorithm so many times. If we cap the number of iterations to, say, 10, then we end up with more jagged walls. But that introduces an extra problem: sometimes it will leave behind very tiny "islands" of walls, which the old alorithm would have smoothed away. That can be fixed by a "manual" cleanup afterward where we simply remove the small islands.

For the first problem, we modify the algorithm slightly:

   Map.prototype.iterate = function() {
      var newMap = new Map(this.width, this.height);
      newMap.clearMap();
      for (var x = 0; x < this.width; x++) {
        for (var y = 0; y < this.height; y++) {
          if (this.numWallsWithinSteps(x, y, 1) >= 5) {
            newMap.setCell(x, y, 1);
          } else if (this.stepNo < 6 && this.numWallsWithinSteps(x, y, 2) < 2) {
            newMap.setCell(x, y, 1);
          }
        }
      }
      newMap.stepNo = this.stepNo + 1;
      return newMap;
    };

The above algorithm can be seen in the map below:

As you can see, this version results in much more "cramped" level than the first version. You can download the full code for the final version here.

This is a pretty simple algorithm and produces nice levels for boss fights, but next time, we'll want to try a slightly more complicated algorithm for generating the room/corridor style levels that you come to expect.

Detecting leaked Activities in Android

Sun, 22 Jun 2014 06:27:00 GMT

So War Worlds has, for quite some time, been plagued with various memory issues. One of the problems is that Android in general is pretty stingy with the heap memory it gives to programs, but the other is that it's actually very easy to "leak" large chunks of memory by accidentally leaving references to Activity objects in static variables.

There's not much that can be done about the former, but the latter can be fairly easily debugged using a couple of tools provided by Android and Java.

Detecting leaked Activties

The first step is figuring out when Activties are actually leaked. In API Level 11 (Honeycomb), Android added the detectActivityLeaks() method to StrictMode.VmPolicy.Builder class. All this method does is cause Android to keep an internal count of the number of instances of each Activity class that exist on the heap, and let you know whenever it hits 2.

When you get a StrictMode violation, there's a few things you can do, you can have Android kill your process, you can log a message, or you can do both. To be honest, neither of those things is particularly helpful because when you've got a leak what you really want is a dump of the heap so that you can analyse it later.

Luckily for us, the StrictMode class writes a message to System.err just before it kills the process, so we can do this little hack to hook in just before that happens and get a HPROF dump just in time:

private static void enableStrictMode() {
  try {
    StrictMode.setVmPolicy(new StrictMode.VmPolicy.Builder()
        .detectActivityLeaks().detectLeakedClosableObjects()
        .detectLeakedRegistrationObjects().detectLeakedSqlLiteObjects()
        .penaltyLog().penaltyDeath().build());

    // Replace System.err with one that'll monitor for StrictMode
    // killing us and perform a hprof heap dump just before it does.
    System.setErr (new PrintStreamThatDumpsHprofWhenStrictModeKillsUs(
        System.err));
  } catch(Exception e) {
    // ignore errors
  }
}

private static class PrintStreamThatDumpsHprofWhenStrictModeKillsUs
    extends PrintStream {
  public PrintStreamThatDumpsHprofWhenStrictModeKillsUs(OutputStream outs) {
    super (outs);
  }

  @Override
  public synchronized void println(String str) {
    super.println(str);
    if (str.startsWith("StrictMode VmPolicy violation with POLICY_DEATH;")) {
      // StrictMode is about to terminate us... do a heap dump!
      try {
        File dir = Environment.getExternalStorageDirectory();
        File file = new File(dir, "strictmode-violation.hprof");
        super.println("Dumping HPROF to: " + file);
        Debug.dumpHprofData(file.getAbsolutePath());
      } catch (Exception e) {
        e.printStackTrace();
      }
    }
  }
}

Essentially, what we do is, we set up StrictMode to kill us, the replace System.err with a hack subclass of PrintSteam that checks for the message StrictMode prints and generates a heap dump right then and there.

It's ugly, but we only have to do it in debug mode anyway, so I think it's OK.

Analysing the dump

Next, we have to copy the .hprof file off the device (you could use adb pull for that, but I like ES File Explorer). One important step is converting the .hprof file from Dalvik format to normal Java format. There's a handy-dandy tool called hprof-conv in the Android SDK which does the conversion for you.

Once you've got a HPROF file a format that can be read by the standard Java tools, the next step is actually analysing it. I was using the Eclipse Memory Analyzer tool, but anything will do, really. The first thing we need to do is find out which objects are leaking. Luckily for us, StrictMode prints out which activity is leaking, so we can start there:

01-15 17:24:23.248: E/StrictMode(13867): class au.com.codeka.warworlds.game.EmpireActivity; instances=2; limit=1
01-15 17:24:23.248: E/StrictMode(13867): android.os.StrictMode$InstanceCountViolation: class au.com.codeka.warworlds.game.EmpireActivity; instances=2; limit=1

So it's detected two copies of the EmpireActivity class. We fire up the Memory Analysis tool in Eclipse, and go to "List objects ... with incoming references" and enter our offending class, "au.com.codeka.warworlds.game.EmpireActivity". And just as StrictMode said, there's our two instances:

Now, how do we determine where they're being held? Right-click one of them and select "Path to GC Roots ... with all references", which will bring up something like this:

If we go with the assumption that the leak happens in our code and not in the framework (usually a pretty good assumption) the only object holding a reference to this activity is our FleetList class. If we expand that out a few times, we can follow the references all the way back to the mSpriteGeneratedListeners member of StarImageManager.

What the StarImageManager class does is it generates (or loads from disk) the bitmaps used to display stars. It has a list of objects that are interested in knowing when the star images are updated (generating them can take a while). What was happening here is that the FleetListAdapter class was adding itself to the list of classes interested in receiving updates from StarImageManager, but never removing itself from that list.

By fixing that bug, I managed to squash one the biggest memory leaks in the game.

Further work

One thing this has highlighted to me is that my system of highly manual event pub/sub is probably not a great idea. Scattered all throughout the code are lots of these "lists of classes who want to subscribe to my event". Management of those lists is all manual, firing the events is manual, and it's actually been a bit of a source of problems in the past (e.g. with events firing on random threads and whatnot).

So the next big chunk of work will be replacing all of these with some kind of unified EventBus pattern. I quite like the implimentation of code.google.com/p/simpleeventbus, in particular I like how it uses WeakReferences to hold the event subscribers (which would make the above memory leak non-existent).

Unicode support in MySQL is ... 👎

Thu, 27 Feb 2014 04:10:00 GMT

For the last few days, I've been getting some strange error reports from the War Worlds server. Messages like this:

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xB8. ...' for column 'message' at row 1
   at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)
   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4120)
   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4052)
   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2503)
   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2664)
   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2815)
   at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
   at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2458)
   at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2375)
   at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2359)
   at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:203)
   at au.com.codeka.warworlds.server.data.SqlStmt.update(SqlStmt.java:117)
   at au.com.codeka.warworlds.server.ctrl.ChatController.postMessage(ChatController.java:120)
   . . .

Now, that string which MySQL complains is "incorrect" is actually the Unicode codepoint U+1F638 GRINNING CAT FACE WITH SMILING EYES, aka 😸 -- a perfectly valid Emoji character. Why was MySQL rejecting it? All my columns are defined to accept UTF-8, so there should not be a problem, right?

When is UTF-8 not UTF-8?

When it's used in MySQL, apparently.

For reasons that completely escape me, MySQL 5.x limits UTF-8 strings to U+FFFF and smaller. That is, the "BMP". Why they call this encoding "UTF-8" is beyond me, it most definitely is not UTF-8.

The trick, apparently, is to use a slightly different encoding which MySQL calls "utf8mb4" which supports up to 4-byte UTF-8 characters.

So the "fix" was simple, just run:

ALTER TABLE chat_messages
   MODIFY message TEXT CHARACTER SET utf8mb4
                       COLLATE utf8mb4_unicode_ci NOT NULL;

And so on, on basically every column in the database which could possibly include characters outside the BMP. But that's not enough! You also need to tell the server to use "utf8mb4" internally as well, by including the following line in your my.cnf:

[mysqld] character-set-server = utf8mb4

Now presumably there is some drawback from doing this, otherwise "utf8mb4" would be the default (right?) but I'll be damned if I can figure out what the drawback is. I guess will just moniter things and see where it takes us. But as of now, War Worlds support Emoji emoticons in chat messages, yay!

Addendum

If you're just seeing squares for the emoji characters in this post, you'll need a font that supports the various unicode emoji blocks. I've used the Symbola font (which you can get here) with good results.

Why you should use a reputable DNS registrar

Sun, 16 Feb 2014 02:00:00 GMT

I've had a bit of a crazy weekend this weekend. It all started while I was checking out some of my websites and I discovered their DNS was not resolving! That's always scary, and after a few minutes I realised that the DNS servers were unreachable from some (but not all) networks. For example, this was a traceroute to one of the DNS servers from my home computer:

$ traceroute 103.30.213.5
traceroute to 103.30.213.5 (103.30.213.5), 30 hops max, 60 byte packets
 1  home.gateway.home.gateway (192.168.1.254)  0.745 ms  1.041 ms  1.334 ms
 2  lns20.syd7.on.ii.net (150.101.199.219)  18.276 ms  19.705 ms  20.402 ms
 3  te3-1-120.cor2.syd6.on.ii.net (150.101.199.241)  22.902 ms  23.762 ms  24.736 ms
 4  ae5.br1.syd7.on.ii.net (150.101.33.50)  138.154 ms  139.785 ms ae0.cr1.syd4.on.ii.net (150.101.33.16)  29.454 ms
 5  ae5.br1.syd4.on.ii.net (150.101.33.48)  30.446 ms ae0.br1.syd4.on.ii.net (150.101.33.14)  36.895 ms ae5.br1.syd4.on.ii.net (150.101.33.48)  37.306 ms
 6  te0-2-0.bdr1.hkg2.on.ii.net (150.101.33.199)  145.732 ms  139.403 ms  140.845 ms
 7  hostvirtual-RGE.hkix.net (202.40.160.179)  159.664 ms  132.746 ms  133.402 ms
 8  vhk.vr.org (208.111.42.5)  134.369 ms  135.393 ms  136.574 ms
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
. . .

It was getting stuck at vhk.vr.org, which seems to be some transit provider or other. But from other networks it was OK, here's the traceroute from my Google Compute Engine instance:

$ traceroute 103.30.213.5
traceroute to 103.30.213.5 (103.30.213.5), 30 hops max, 60 byte packets
 1  216.239.46.192 (216.239.46.192)  1.168 ms 216.239.43.216 (216.239.43.216)  1.429 ms 216.239.46.192 (216.239.46.192)  1.105 ms
 2  216.239.46.192 (216.239.46.192)  1.383 ms  1.072 ms 216.239.43.218 (216.239.43.218)  1.335 ms
 3  216.239.46.190 (216.239.46.190)  1.477 ms 216.239.43.218 (216.239.43.218)  1.367 ms 216.239.46.192 (216.239.46.192)  1.287 ms
 4  216.239.43.216 (216.239.43.216)  1.656 ms 216.239.46.190 (216.239.46.190)  1.434 ms 216.239.43.218 (216.239.43.218)  1.914 ms
 5  216.239.43.216 (216.239.43.216)  1.910 ms  1.886 ms  1.868 ms
 6  216.239.43.218 (216.239.43.218)  1.841 ms  1.257 ms 216.239.46.190 (216.239.46.190)  1.470 ms
 7  216.239.46.192 (216.239.46.192)  1.196 ms 216.239.43.218 (216.239.43.218)  1.289 ms 216.239.43.216 (216.239.43.216)  1.496 ms
 8  216.239.43.218 (216.239.43.218)  1.643 ms 216.239.43.216 (216.239.43.216)  1.457 ms 216.239.43.218 (216.239.43.218)  1.586 ms
 9  209.85.248.215 (209.85.248.215)  11.936 ms 72.14.232.140 (72.14.232.140)  11.761 ms 209.85.248.229 (209.85.248.229)  14.151 ms
10  209.85.254.239 (209.85.254.239)  15.746 ms 72.14.237.132 (72.14.237.132)  11.702 ms 72.14.237.131 (72.14.237.131)  14.216 ms
11  209.85.255.133 (209.85.255.133)  14.355 ms 209.85.255.27 (209.85.255.27)  14.187 ms  13.759 ms
12  * * *
13  db-transit.Gi9-12.br01.rst01.pccwbtn.net (63.218.125.26)  39.572 ms  37.646 ms  39.849 ms
14  viad-vc.as36236.net (209.177.157.8)  39.499 ms  37.372 ms  39.243 ms
15  pdns1.terrificdns.com (103.30.213.5)  40.454 ms  39.940 ms  41.335 ms

So that seems kind of scary! I quickly composed an email to the support address with what I'd seen. An hour later and no response. At this point I was worried, because I had no idea how many people were unable to contact my websites. I tried logging in to the management console, but that was also a no-go: the DNS for the management console was hosted on the same DNS servers that were not responding! I managed to log in by hard-coding the IP address (which I got from my GCE server that was able to connect to the DNS servers) in my /etc/hosts file.

Just trying to get something up & running again, I exported all my DNS records and imported them into a friend's DNS server. Then I was able to change the nameservers configured in the management console to point to my friend's DNS server. For now, we were back up again. But a few hours later and still no response from my support request.

Who are NameTerrific?

I first heard about NameTerrific in this post on HackerNews. The website was well-done, the interface was easy to use. So I decided to give it a go with some of my more throwaway domains. I had a minor issue early on, and the support was quite good and over the next year or so, I started to move a couple more domains over.

Then, I stopped thinking about it. It worked well for the next year or so, and DNS tends to be one of those things you don't really think about. Until it stops working.

It was probably my fault. I didn't put much effort into researching NameTerrific's founder, Ryan Zhou. He seems to be a serial entrepreneur who dropped out of school to persue his dream. That's all well and good, but when you're hosting a service for people, it doesn't do much for your reputation as a reliable investment when you abandon your previous business for who-knows-what-reason.

What do I think happened?

I think he discovered his website had bugs and people's domains were being transferred to the wrong people. Why do I suspect that? Because it's happening to me right now. Check this out. I go into my control panel for one of my domains, war-worlds.com:

I click on "Transfer Away", it prompts me to confirm, I click OK and I receive the following email:

It's an authorization code for a domain I don't even own. What's worse, if I do a whois on mitchortenburg.com, I find myself listed in the contact information! I seem to be the owner of a domain I never purchased (and, to be honest, don't really want) because of some bizarre mixup with the management website.

Even worse still: I have no way to generate an auth code for war-worlds.com (the domain I do care about) and I'm terrified that some other customer of NameTerrific's is somehow able to do what I've managed to do and gain ownership of my domain!

I'm not the only one having problems. Their facebook page is full of people who have also apparently realized they've been abandoned:

This is not how you run a busines. If you find bugs in your management console, you don't abandon your business and leave your customer in the lurch.

Now it seems Ryan has at least learnt from this failure to not let people post on his business' facebook wall. His "CoinJar" business has disabbled wall posts entirely.

What are my options?

So I currently have a ticket open with eNom (NameTerrific were an eNom reseller) and I hope I can get control of my domain back again. And if Mitch Ortenburg is reading this, I'm also quite happy to return your domain to you, if I only knew how to contact you...

And of course, I have already transferred every domain I can out of NameTerrific and into a reputable registrar. Lesson learnt!

I don't think Samsung Mob!lers threatened to strand bloggers at Berlin

Mon, 03 Sep 2012 19:06:10 GMT

So I was reading this post on Hacker News wherein some bloggers were apparently threatened with cancelled flights home after they refused to man Samsung booths at a trade show.

Now, to me, this seemed rather fishy. Why would Samsung do something so obviously malicious? I had a look at the Samsung Mob!lers website, and it's pretty clear how the site works. You write articles on topics Samsung wants you to write about (there's currently one up there now asking bloggers "to showcase the extent of its [S-Voice] functionality in a fun, adventurous and creative way.") In return for writing blog posts on behalf of Samsung, you earn "points" which you can then redeem for "rewards".

Here's the page that lists a trip to the IFA consumer show as a "reward" in the Mob!lers program. So let's be clear here: the trip to the IFA consumer show was a reward for writing blog posts praising Samsung products. This is not a case of Samsung inviting bloggers to trade shows to provide coverage, it's a prize.

The TNW article goes to great pains to point out that sending bloggers to trade shows is normal fare for large companies. But that's a red herring -- these are not bloggers invited to the event to provide coverage for Samsung. They were invited to the event as a reward for writing blog posts about Samsung products. The article also goes to pains to point out that programs like Samsung's Mob!lers are common. But it doesn't go into any details about what the program entails (except to say that the Mob!lers program expects you to become a "shill" for the company).

So here's what I think happened: I think these bloggers earned their points on Samsung's Mob!lers website. In return, Samsung rewarded them with a free trip to Berlin to the IFA trade show. My guess is that Samsung made it pretty clear they would be manning Samsung booths at the show (after all, that was the "warning signs" the article talks about: fitting for uniforms and such). The bloggers probably told Samsung that they wanted to review other products while they were there and Samsung said that's fine.

Now we get to the "smoking gun" email that TNW posts. In light of the above, all the email says is that they refused to go the Samsung event and just stayed in their hotel, and now Samsung wants to send them home. It doesn't say they weren't told they would have to man the booths. It doesn't say that flights had been cancelled. Just that they want to send them home early.

And if they're not doing what they were asked to do, then why wouldn't you want to send them home early? All we have on the "cancel flights" and "stranding" is the word of the bloggers who clearly weren't interested in doing their part.

Debugging an issue with high CPU load

Fri, 24 Aug 2012 06:07:13 GMT

For the last few days, one of the websites I work on was having a strange issue. Every now and then, the whole server would grind to a halt and stop serving traffic. This would last for a couple of minutes, then suddenly everything would "magically" pick up and we'd be back to normal. For a while...

Simple website monitoring

The first part to figuring out what was going on was coming up with a way to be notified when the issue was occuring. It's hard to debug these kind of issues after the fact. So I wanted to set up something that would ping the server and email me whenever it was taking too long to respond. I knocked up the following script file (I might've inserted backslashes at invalid locations, sorry, that's just to ease reading on the blog):

#!/bin/bash

TIME=$( { /usr/bin/time -f %e wget -q -O /dev/null \
          http://www.example.com/; } 2>&1 )
TOOSLOW=$(awk "BEGIN{ print ($TIME>2.5) }")

if [ $TOOSLOW -eq 1 ]; then
  echo "The time for this request, $TIME, was greater than 2.5 seconds!" \
    | mail -s "Server ping ($TIME sec)" "me@me.com"
fi

I set this up as a cron job on my media centre PC (high-tech, I know) to run every 5 minutes. It would email me whenever the website took longer than 2.5 seconds to respond (a "normal" response time is < 0.5 seconds, so I figured 5 times longer was enough).

It didn't take long for the emails to start coming through. Then it was a matter of jumping on the server and trying to figure out what the problem was.

First steps

Once the problem was happening, there's a couple of "obvious" first things to try out. The first thing I always do is run top and see what's happening:

top - 08:51:03 up 73 days,  7:45,  1 user,  load average: 69.00, 58.31, 46.89
Tasks: 316 total,   2 running, 314 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.0%us,  1.3%sy,  0.0%ni, 15.2%id, 72.0%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:   8299364k total,  7998520k used,   300844k free,    15480k buffers
Swap: 16779884k total,     4788k used, 16775096k free,  6547860k cached

Check out that load! 69.00 in the last five minutes, that's massive! Also of concern is 75% next to "wa", which means 75% of the CPU time was spend in an uninterruptable wait. There's not many things that run in uninterruptable wait (usually kernal threads), and it's usually some I/O sort of thing. So lets see what iotop (which is like top but for I/O) reports:

Total DISK READ: 77.37 K/s | Total DISK WRITE: 15.81 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
25647 be/4 apache     73.50 K/s    0.00 B/s  0.00 % 99.99 % httpd
24387 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
23813 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [pdflush]
25094 be/4 root        0.00 B/s    0.00 B/s 96.72 % 99.99 % [pdflush]
25093 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
25095 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
25091 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [pdflush]
24389 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
24563 be/4 root        0.00 B/s    0.00 B/s 99.99 % 99.99 % [pdflush]
24390 be/4 apache      0.00 B/s   23.21 K/s 96.71 % 99.99 % httpd
24148 be/4 apache      0.00 B/s    0.00 B/s 96.71 % 99.99 % httpd
24699 be/4 apache      0.00 B/s    0.00 B/s 99.99 % 99.99 % httpd
23973 be/4 apache      0.00 B/s    0.00 B/s 99.99 % 99.99 % httpd
24270 be/4 apache      0.00 B/s    0.00 B/s 99.99 % 99.99 % httpd
24298 be/4 apache      0.00 B/s 1918.82 K/s 96.71 % 99.02 % httpd
  628 be/3 root        0.00 B/s    0.00 B/s  0.00 % 97.51 % [kjournald]
25092 be/4 root        0.00 B/s    0.00 B/s  0.00 % 96.72 % [pdflush]
24258 be/4 root        0.00 B/s    0.00 B/s 99.99 % 96.71 % [pdflush]
23814 be/4 root        0.00 B/s    0.00 B/s  0.00 % 96.71 % [pdflush]
24388 be/4 root        0.00 B/s    0.00 B/s 99.02 % 96.71 % [pdflush]
25545 be/4 apache      0.00 B/s    0.00 B/s  0.19 % 92.73 % httpd
25274 be/4 apache      0.00 B/s    0.00 B/s  0.00 % 92.38 % httpd
24801 be/4 apache      0.00 B/s    5.84 M/s 99.99 % 91.63 % httpd
25281 be/4 apache      0.00 B/s    5.75 M/s  0.00 % 91.33 % httpd
26115 be/4 apache      0.00 B/s    0.00 B/s  9.60 % 19.26 % httpd
25561 be/4 apache      0.00 B/s    3.87 K/s  0.00 %  9.66 % httpd
26035 be/4 apache      0.00 B/s    0.00 B/s  0.00 %  9.63 % httpd

So all those pdflush commands looked suspicious to me. pdflush is a kernel thread that's responsible for writing out dirty pages of memory to disk in order to free up memory.

It was at this point that I was suspecting some kind of hardware failure. Checking the output of sar -d 5 0 I could see this:

Linux 2.6.18-308.1.1.el5PAE (XXX)  23/08/12

08:55:45          DEV       tps  ...     await     svctm     %util
08:55:50       dev8-0    877.25  ...    179.28      1.14     99.84
08:55:50       dev8-1      0.00  ...      0.00      0.00      0.00
08:55:50       dev8-2      0.00  ...      0.00      0.00      0.00
08:55:50       dev8-3    877.25  ...    179.28      1.14     99.84

Check out that utilization column! 99.84% is really bad (more than 70% or so is when you'd start to have problems)

I was at a bit of a loss, because I'm not too familar with the hardware that's running this server (it's not mine) but I knew the disks were in hardware RAID and smartctl wasn't being helpful at all, so I posted a question on Server Fault. At this point, I was thinking it was a hardware problem, but I wasn't sure where to go from there.

My first hint

My first hint was a comment by Mark Wagner:

Apache PIDs 24801 and 25281 are doing by far the most I/O: 5.84 M/s and 5.75 M/s, respectively. I use iotop -o to exclude processes not doing I/O

What were those two processes doing? I opened one of them up with strace:

# strace -p 24801
[sudo] password for dean: 
Process 24801 attached - interrupt to quit
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304

It was just spewing those lines out, writing huge amount of data to file number "26". But what is file number 26? For that, we use the handy-dandy lsof:

# lsof -p 24801
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF      NODE NAME
httpd   24801 apache  cwd    DIR    8,3     4096         2 /
httpd   24801 apache  rtd    DIR    8,3     4096         2 /
httpd   24801 apache  txt    REG    8,3   319380   6339228 /usr/sbin/httpd
. . .
httpd   24801 apache   26w  0000    8,3        0  41713666 /tmp/magick-XXgvRhzG

So at the end of the output from lsof, we see under the "FD" (for "file descriptor") column, 26w (the "w" means it's open for writing) and the file is actually /tmp/magick-XXgvRhzG.

A quick ls on the /tmp directory, and I'm shocked:

-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXrQahSe
-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXTaXatz
-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXtf25pe

These files are 1.6 terrabytes! Luckily(?) they're sparse files so they don't actually take up that much physical space (the disks aren't even that big), but that's definitely not good.

The last piece of the puzzle was figuring out what images were being worked on to use those enormous temporary files. I was thinking maybe there was a corrupt .png file or something that was causing ImageMagick to freak out, but firing up Apache's wonderful mod_status and I could immediately see that the problem was my own dumb self: the URL it was requesting was:

/photos/view/size:320200/4cc41224cae04e52b76041be767f1340-q

Now, if you don't spot it right away, it's the "size" parameter: size:320200, it's supposed to be size:320,200 - what my code does if you leave off the "height" is it assumes you want to display the image at the specified width, but with the same aspect ratio. So it was trying to generate an image that was 320200x200125, rather than 320x200!

The Solution

The solution was, as is often the case, extremely simple once I'd figured out the problem. I just made sure the image resizer never resized an image to be larger than the original (our originals are generally always bigger than what's displayed on the website).

The only remaining question was where was this request coming from? The output of mod_status showed an IP address that belonged to Google, so it must've been the Googlebot crawling the site. But a quick search through the database showed no links to an image with the invalid size:320200 parameter.

At this stage, it's still an open mystery where that link was coming from. The image in question was from an article written in 2010, so it's not something current. In any case, where the link was coming from is of less concern than the fact that it no longer causes the whole server to grind to a halt whenever someone requests it.

So I'm happy to leave that one open for now.

App Engine task failed for no apparent reason (on the development server)

Tue, 08 May 2012 07:18:42 GMT

So I ran into an interesting issue with App Engine task queues today. They would fail for no apparent reason, without even executing any of my code (but on the development server only). In the log, I would see the following:

WARNING  2012-05-07 11:56:25,644 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.100 seconds
WARNING  2012-05-07 11:56:25,746 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.200 seconds
WARNING  2012-05-07 11:56:25,947 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.400 seconds
WARNING  2012-05-07 11:56:25,644 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.100 seconds
WARNING  2012-05-07 11:56:25,746 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.200 seconds
WARNING  2012-05-07 11:56:25,947 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.400 seconds

I was a little confused, so I had a look in taskqueue_stub.py around line 1936. I could see basically all it was doing was connecting to my URL and issuing a GET just like I had configured it to. On a whim, I added an extra line of logging so that it now output the following:

WARNING  2012-05-07 11:56:25,643 taskqueue_stub.py:1828] Connecting to: 192.168.1.4 (default host is 0.0.0.0:8271)
WARNING  2012-05-07 11:56:25,644 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.100 seconds
WARNING  2012-05-07 11:56:25,745 taskqueue_stub.py:1828] Connecting to: 192.168.1.4 (default host is 0.0.0.0:8271)
WARNING  2012-05-07 11:56:25,746 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.200 seconds
WARNING  2012-05-07 11:56:25,946 taskqueue_stub.py:1828] Connecting to: 192.168.1.4 (default host is 0.0.0.0:8271)
WARNING  2012-05-07 11:56:25,947 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.400 seconds

Here, the "default host" is what I specified in the command-line for the development server to listen on (all IPs, port 8271). However, 192.168.1.4 is my machine's IP address -- but note there's no port specified. It would try to connect to port 80!

I couldn't figure out why until I looked at the code I was calling that was enqueueing the task to begin with. It's an Android app, and it uses HttpCore to manage the HTTP requests it makes to my App Engine app (I'm thinking of switching to the built-in URLConnection after all, but we'll see...) and the code has the following lines:

// This is paraphrasing slightly
String requestUrl = uri.getPath();
if (uri.getQuery() != null && uri.getQuery() != "") {
    requestUrl += "?"+uri.getQuery();
}

BasicHttpRequest request = new BasicHttpRequest(method, requestUrl)
request.addHeader("Host", uri.getHost());
// ...

You might see the problem already...

The problem is, we were setting the "Host" header to 192.168.1.4 (no port number!) and the task library was then using the Host header we supplied to know which host to connect to in order to execute the task.

The solution is actually simple:

String host = uri.getHost();
if ((uri.getScheme().equals("http") && uri.getPort() != 80) ||
        (uri.getScheme().equals("https") && uri.getPort() != 443)) {
    host += ":"+uri.getPort();
}
request.addHeader("Host", host);

Basically, if we're using the default HTTP or HTTPS port, don't include it in the Host field. Otherwise, make sure the port number is included.

This would never be a problem if you were executing tasks from a browser (or using a more high-level library that auto-populated the Host header for me!) so it's understandable that I couldn't really find any other people having this problem. Maybe I should just switch to using URLConnection