Dean's Blog http://www.codeka.com.au/blog Development blog of Dean Harding. en-us Thu, 27 Feb 2014 12:10:00 GMT Unicode support in MySQL is ... 👎 http://www.codeka.com.au/blog/2014/02/unicode-support-in-mysql-is-- Thu, 27 Feb 2014 12:10:00 GMT http://www.codeka.com.au/blog/2014/02/unicode-support-in-mysql-is-- <p>For the last few days, I&#39;ve been getting some strange error reports from the&nbsp;War Worlds&nbsp;server. Messages like this:</p><pre class="brush: text">java.sql.SQLException: Incorrect string value: &#39;\xF0\x9F\x98\xB8. ...&#39; for column &#39;message&#39; at row 1 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4120) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4052) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2503) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2664) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2815) at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2458) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2375) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2359) at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:203) at au.com.codeka.warworlds.server.data.SqlStmt.update(SqlStmt.java:117) at au.com.codeka.warworlds.server.ctrl.ChatController.postMessage(ChatController.java:120) . . .</pre><p>Now, that string which MySQL complains is &quot;incorrect&quot; is actually the Unicode codepoint U+1F638 GRINNING CAT FACE WITH SMILING EYES, aka&nbsp;😸 -- a perfectly valid Emoji character. Why was MySQL rejecting it? All my columns are defined to accept UTF-8, so there should not be a problem, right?</p><h2>When is UTF-8 not UTF-8?</h2><p>When it&#39;s used in MySQL, apparently.</p><p>For reasons that completely escape me, MySQL 5.x limits UTF-8 strings to U+FFFF and smaller. That is, the &quot;BMP&quot;. Why they call this encoding &quot;UTF-8&quot; is beyond me, it most definitely is&nbsp;not&nbsp;UTF-8.</p><p>The trick, apparently, is to use a slightly different encoding which MySQL calls &quot;utf8mb4&quot; which supports up to 4-byte UTF-8 characters.</p><p>So the &quot;fix&quot; was simple, just run:</p><pre class="brush: sql;">ALTER TABLE chat_messages &nbsp; MODIFY message TEXT CHARACTER SET utf8mb4 &nbsp; COLLATE utf8mb4_unicode_ci NOT NULL;</pre><p>And so on, on basically every column in the database which could possibly include characters outside the BMP.&nbsp;But that&#39;s not enough!&nbsp;You also need to tell the server to use &quot;utf8mb4&quot; internally as well, by including the following line in your my.cnf:</p><pre class="brush: text;">[mysqld] ​character-set-server = utf8mb4</pre><p>Now presumably there is some drawback from doing this, otherwise &quot;utf8mb4&quot; would be the default (right?) but I&#39;ll be damned if I can figure out what the drawback is. I guess will just moniter things and see where it takes us. But as of now, War Worlds support Emoji emoticons in chat messages, yay!</p><h2>Addendum</h2><p>If you&#39;re just seeing squares for the emoji characters in this post, you&#39;ll need a font that supports the various unicode emoji blocks. I&#39;ve used the Symbola font (<a href="http://users.teilar.gr/~g1951d/">which you can get here</a>) with good results.</p> Why you should use a reputable DNS registrar http://www.codeka.com.au/blog/2014/02/why-you-should-use-a-reputable-dns-registrar Sun, 16 Feb 2014 10:00:00 GMT http://www.codeka.com.au/blog/2014/02/why-you-should-use-a-reputable-dns-registrar <p>I&#39;ve had a bit of a crazy weekend this weekend.&nbsp;<span style="font-size: 12pt;">It all started while I was checking out some of my websites and I discovered their DNS was not resolving! That&#39;s always scary, and after a few minutes I realised that the DNS servers were unreachable from some (but not all) networks. For example, this was a traceroute to one of the DNS servers from my home computer:</span></p><pre class="brush: bash;">$ traceroute 103.30.213.5 traceroute to 103.30.213.5 (103.30.213.5), 30 hops max, 60 byte packets 1 home.gateway.home.gateway (192.168.1.254) 0.745 ms 1.041 ms 1.334 ms 2 lns20.syd7.on.ii.net (150.101.199.219) 18.276 ms 19.705 ms 20.402 ms 3 te3-1-120.cor2.syd6.on.ii.net (150.101.199.241) 22.902 ms 23.762 ms 24.736 ms 4 ae5.br1.syd7.on.ii.net (150.101.33.50) 138.154 ms 139.785 ms ae0.cr1.syd4.on.ii.net (150.101.33.16) 29.454 ms 5 ae5.br1.syd4.on.ii.net (150.101.33.48) 30.446 ms ae0.br1.syd4.on.ii.net (150.101.33.14) 36.895 ms ae5.br1.syd4.on.ii.net (150.101.33.48) 37.306 ms 6 te0-2-0.bdr1.hkg2.on.ii.net (150.101.33.199) 145.732 ms 139.403 ms 140.845 ms 7 hostvirtual-RGE.hkix.net (202.40.160.179) 159.664 ms 132.746 ms 133.402 ms 8 vhk.vr.org (208.111.42.5) 134.369 ms 135.393 ms 136.574 ms 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * . . .</pre><p>It was getting stuck at vhk.vr.org, which seems to be some transit provider or other. But from other networks it was OK, here&#39;s the traceroute from my Google Compute Engine instance:</p><pre class="brush: bash;">$ traceroute 103.30.213.5 traceroute to 103.30.213.5 (103.30.213.5), 30 hops max, 60 byte packets 1 216.239.46.192 (216.239.46.192) 1.168 ms 216.239.43.216 (216.239.43.216) 1.429 ms 216.239.46.192 (216.239.46.192) 1.105 ms 2 216.239.46.192 (216.239.46.192) 1.383 ms 1.072 ms 216.239.43.218 (216.239.43.218) 1.335 ms 3 216.239.46.190 (216.239.46.190) 1.477 ms 216.239.43.218 (216.239.43.218) 1.367 ms 216.239.46.192 (216.239.46.192) 1.287 ms 4 216.239.43.216 (216.239.43.216) 1.656 ms 216.239.46.190 (216.239.46.190) 1.434 ms 216.239.43.218 (216.239.43.218) 1.914 ms 5 216.239.43.216 (216.239.43.216) 1.910 ms 1.886 ms 1.868 ms 6 216.239.43.218 (216.239.43.218) 1.841 ms 1.257 ms 216.239.46.190 (216.239.46.190) 1.470 ms 7 216.239.46.192 (216.239.46.192) 1.196 ms 216.239.43.218 (216.239.43.218) 1.289 ms 216.239.43.216 (216.239.43.216) 1.496 ms 8 216.239.43.218 (216.239.43.218) 1.643 ms 216.239.43.216 (216.239.43.216) 1.457 ms 216.239.43.218 (216.239.43.218) 1.586 ms 9 209.85.248.215 (209.85.248.215) 11.936 ms 72.14.232.140 (72.14.232.140) 11.761 ms 209.85.248.229 (209.85.248.229) 14.151 ms 10 209.85.254.239 (209.85.254.239) 15.746 ms 72.14.237.132 (72.14.237.132) 11.702 ms 72.14.237.131 (72.14.237.131) 14.216 ms 11 209.85.255.133 (209.85.255.133) 14.355 ms 209.85.255.27 (209.85.255.27) 14.187 ms 13.759 ms 12 * * * 13 db-transit.Gi9-12.br01.rst01.pccwbtn.net (63.218.125.26) 39.572 ms 37.646 ms 39.849 ms 14 viad-vc.as36236.net (209.177.157.8) 39.499 ms 37.372 ms 39.243 ms 15 pdns1.terrificdns.com (103.30.213.5) 40.454 ms 39.940 ms 41.335 ms</pre><p>So that seems kind of scary! I quickly composed an email to the support address with what I&#39;d seen. An hour later and no response. At this point I was worried, because I had no idea how many people were unable to contact my websites. I tried logging in to the management console, but that was also a no-go: the DNS for the management console was hosted on the same DNS servers that were not responding! I managed to log in by hard-coding the IP address (which I got from my GCE server that was able to connect to the DNS servers) in my /etc/hosts file.</p><p>Just trying to get something up &amp; running again, I exported all my DNS records and imported them into a friend&#39;s DNS server. Then I was able to change the nameservers configured in the management console to point to my friend&#39;s DNS server. For now, we were back up again. But a few hours later and still no response from my support request.</p><h2>Who are NameTerrific?</h2><p>I first heard about NameTerrific in <a href="https://news.ycombinator.com/item?id=4743455">this post on HackerNews</a>. The website was well-done, the interface was easy to use. So I decided to give it a go with some of my more throwaway domains. I had a minor issue early on, and the support was quite good and over the next year or so, I started to move a couple more domains over.</p><p>Then, I stopped thinking about it. It worked well for the next year or so, and DNS tends to be one of those things you don&#39;t really think about. Until it&nbsp;<em>stops</em>&nbsp;working.</p><p>It was probably my fault. I didn&#39;t put much effort into researching NameTerrific&#39;s founder, <a href="https://www.zhoutong.com/">Ryan Zhou</a>. He seems to be a serial entrepreneur who dropped out of school to persue his dream. That&#39;s all well and good, but when you&#39;re hosting a service for people, it doesn&#39;t do much for your reputation as a reliable investment when you abandon your previous business for who-knows-what-reason.</p><h2>What do I think happened?</h2><p>&nbsp;I think he discovered his website had bugs and people&#39;s domains were being transferred to the wrong people. Why do I suspect that? Because it&#39;s <em>happening to me right now</em>. Check this out. I go into my control panel for one of my domains, war-worlds.com:</p><p style="text-align: center;"><a title="" class="lightbox" href="http://lh6.ggpht.com/irhFZ2RBJcpD-HCIxfyhUSHriCYPQPJOdRAnkfKyBQ-Y2J5I7dC6VqcMk0RIHHN2xpbjiDESigPqNtjJfIDE0gc=s1026"><img alt="" data-resp="eyJzdWNjZXNzIjp0cnVlLCJ1cmwiOiJodHRwOi8vbGg2LmdncGh0LmNvbS9pcmhGWjJSQkpjcEQtSENJeGZ5aFVTSHJpQ1lQUVBKT2RSQW5rZkt5QlEtWTJKNUk3ZEM2VnFjTWswUklISE4yeHBiamlERVNpZ1BxTnRqSmZJREUwZ2M9czEwMCIsImJsb2Jfa2V5IjoiQU1JZnY5NHpILV9KMjZLalhfSHB5TVhVdjhHZFpoQnNEMHQ4YjVVZ1BOTC1RdUxhQjhVNFZpSXNUOEpreGpJNlpfSG5BSHp6dUtFZkx1UHdiY1gwWFl4ckxVajJ6SXY3WVZEV2JsZDE4d3RXd1BzSkdRcDVBRXdBaTBHM3dYeVBXaGxDdjBIa2tXajVDZWFuUTNQSWhsX1JQV3lFX3ctc0tBIiwiaGVpZ2h0Ijo2NzEsIndpZHRoIjoxMDI2LCJmaWxlbmFtZSI6IlNjcmVlbnNob3QgZnJvbSAyMDE0LTAyLTE3IDIyOjMzOjI1LnBuZyIsInNpemUiOjYyNzkwfQ==" src="http://lh6.ggpht.com/irhFZ2RBJcpD-HCIxfyhUSHriCYPQPJOdRAnkfKyBQ-Y2J5I7dC6VqcMk0RIHHN2xpbjiDESigPqNtjJfIDE0gc=s600" /></a></p><p>I click on &quot;Transfer Away&quot;, it prompts me to confirm, I click OK and I receive the following email:</p><p style="text-align: center;"><img alt="" data-resp="eyJzdWNjZXNzIjp0cnVlLCJ1cmwiOiJodHRwOi8vbGg1LmdncGh0LmNvbS9tUHZxdk1ZVjZQOWoxdVhsaldsOVotM3l6cTN2aUtVeEpaVkFiYW95MWY1bVVHUkF6ckc0QkxYTDN3V2F2U21ZWnNrVXJZeU5EMTM1OEFWN01zalI9czEwMCIsImJsb2Jfa2V5IjoiQU1JZnY5NlNiVERIRkhrMWhhZFBkdVBackMyMExQMHRMQTdUOTRsSFhZOWxrZEdCZm1ZS1NwUER3X0U3bVZMWjdkaWJZWFVydmxaS0NFcDIzdktCOFotU1hUR1JuQ1hsdkhsaFNxZC1aNXJNbkJDYkRhb1FKT1pXY2pmQVp3b0dTZ2drZXVVV3ZXeGRMLTREamxIN0lPV2xEYnNhMzBySUp3IiwiaGVpZ2h0IjozMSwid2lkdGgiOjY0MCwiZmlsZW5hbWUiOiJTY3JlZW5zaG90IGZyb20gMjAxNC0wMi0xNyAyMjo0NToyMi5wbmciLCJzaXplIjo0NjA3fQ==" src="http://lh5.ggpht.com/mPvqvMYV6P9j1uXljWl9Z-3yzq3viKUxJZVAbaoy1f5mUGRAzrG4BLXL3wWavSmYZskUrYyND1358AV7MsjR=s640" /></p><p>It&#39;s an authorization code <em>for a domain I don&#39;t even own</em>. What&#39;s worse, if I do a whois on mitchortenburg.com, I find <em>myself listed in the contact information</em>! I seem to be the owner of a domain I never purchased (and, to be honest, don&#39;t really want) because of some bizarre mixup with the management website.</p><p>Even worse still: I have no way to generate an auth code for war-worlds.com (the domain I <em>do</em> care about) and I&#39;m terrified that some other customer of NameTerrific&#39;s is somehow able to do what I&#39;ve managed to do and gain ownership of my domain!</p><p>I&#39;m not the only one having problems. Their facebook page is full of people who have also apparently realized they&#39;ve been abandoned:</p><p style="text-align: center;"><img alt="" data-resp="eyJzdWNjZXNzIjp0cnVlLCJ1cmwiOiJodHRwOi8vbGg0LmdncGh0LmNvbS9Wc0RGVEdCeWpFaFZMckt1NW1nbUVJUEF3bmNlekQ3bVVtZGphazVheGRLREtwY05wa3RsQTltUWxLUXpwcGFoOE9qaV8wRzNnblRSSllWbnBXaFRhdz1zMTAwIiwiYmxvYl9rZXkiOiJBTUlmdjk3aGZFQ2twLTFYWGt2U0tqX052WXRfNWlnYUs5UEtrSTFWVEdORnZPWkJ3V0RrVjFoN3liNFpKRENudmVrSDNzX0RidkhWbkpQaWEwam9WZ3JGRndxMHh4a2E4VEtRWlhHbm9WMWhJZWpONm4tM2ozbEh4a3o5QWxnTGk1WkR3U1hlU2RqbXh6LTYyLU5ncFVxRW9Ld1l1dTFfZlEiLCJoZWlnaHQiOjEwNzQsIndpZHRoIjo1NDcsImZpbGVuYW1lIjoiU2NyZWVuc2hvdCBmcm9tIDIwMTQtMDItMTcgMjI6NTc6NTMucG5nIiwic2l6ZSI6MTMxNTg0fQ==" src="http://lh4.ggpht.com/VsDFTGByjEhVLrKu5mgmEIPAwncezD7mUmdjak5axdKDKpcNpktlA9mQlKQzppah8Oji_0G3gnTRJYVnpWhTaw=s1074" /></p><p>This is not how you run a busines. If you find bugs in your management console, you don&#39;t abandon your business and leave your customer in the lurch.</p><p>Now it seems Ryan has at least learnt from this failure to not let people post on his business&#39; facebook wall. His &quot;CoinJar&quot; business has disabbled wall posts entirely.</p><h2>What are my options?</h2><p>So I currently have a ticket open with eNom (NameTerrific were an eNom reseller) and I hope I can get control of my domain back again. And if Mitch Ortenburg is reading this, I&#39;m also quite happy to return your domain to you, if I only knew how to contact you...</p><p>And of course, I have already transferred every domain I can out of NameTerrific and into a reputable registrar. Lesson learnt!</p> I don't think Samsung Mob!lers threatened to strand bloggers at Berlin http://www.codeka.com.au/blog/2012/09/i-don-t-think-samsung Tue, 04 Sep 2012 02:06:10 GMT http://www.codeka.com.au/blog/2012/09/i-don-t-think-samsung <p>So I was reading <a href="http://news.ycombinator.com/item?id=4468037">this post</a> on Hacker News wherein some bloggers were apparently <a href="http://thenextweb.com/insider/2012/09/02/heres-samsung-flew-bloggers-halfway-around-world-threatened-leave/?utm_campaign=social%20media&amp;awesm=tnw.to_a4CW&amp;utm_source=Twitter&amp;utm_medium=Spreadus">threatened with cancelled flights home after they refused to man Samsung booths at a trade show</a>.</p> <p>Now, to me, this seemed rather fishy. Why would Samsung do something so obviously malicious? I had a look at the <a href="http://samsungmobilers.co.uk/">Samsung Mob!lers</a> website, and it's pretty clear how the site works. You write articles on topics Samsung wants you to write about (there's currently <a href="http://samsungmobilers.co.uk/public/missions/view/88">one up there now</a> asking bloggers "to showcase the extent of its [S-Voice] functionality in a fun, adventurous and creative way.") In return for writing blog posts on behalf of Samsung, you earn "points" which you can then redeem for "rewards".</p> <p><a href="http://samsungmobilers.co.uk/public/rewards/view/7">Here's the page that lists a trip to the IFA consumer show as a "reward" in the Mob!lers program</a>. So let's be clear here: the trip to the IFA consumer show was a <em>reward</em> for writing blog posts praising Samsung products. This is not a case of Samsung inviting bloggers to trade shows to provide coverage, it's a <em>prize</em>.</p> <p>The TNW article goes to great pains to point out that sending bloggers to trade shows is normal fare for large companies. But that's a red herring -- these are not bloggers invited to the event to provide coverage for Samsung. They were invited to the event as a reward for writing blog posts about Samsung products. The article also goes to pains to point out that programs like Samsung's Mob!lers are common. But it doesn't go into any details about what the program entails (except to say that the Mob!lers program expects you to become a "shill" for the company).</p> <p>So here's what I think happened: I think these bloggers earned their points on Samsung's Mob!lers website. In return, Samsung rewarded them with a free trip to Berlin to the IFA trade show. My guess is that Samsung made it pretty clear they would be manning Samsung booths at the show (after all, that was the "warning signs" the article talks about: fitting for uniforms and such). The bloggers probably told Samsung that they wanted to review other products while they were there and Samsung said that's fine.</p> <p>Now we get to the "smoking gun" email that TNW posts. In light of the above, all the email says is that they refused to go the Samsung event and just stayed in their hotel, and now Samsung wants to send them home. It doesn't say they weren't told they would have to man the booths. It doesn't say that flights had been cancelled. Just that they want to send them home early.</p> <p>And if they're not doing what they were asked to do, then why wouldn't you want to send them home early? All we have on the "cancel flights" and "stranding" is the word of the bloggers who clearly weren't interested in doing their part.</p> Debugging an issue with high CPU load http://www.codeka.com.au/blog/2012/08/debugging-an-issue-with-high Fri, 24 Aug 2012 13:07:13 GMT http://www.codeka.com.au/blog/2012/08/debugging-an-issue-with-high <p>For the last few days, one of the websites I work on was having a strange issue. Every now and then, the whole server would grind to a halt and stop serving traffic. This would last for a couple of minutes, then suddenly everything would "magically" pick up and we'd be back to normal. For a while...</p> <h2>Simple website monitoring</h2> <p>The first part to figuring out what was going on was coming up with a way to be notified when the issue was occuring. It's hard to debug these kind of issues after the fact. So I wanted to set up something that would ping the server and email me whenever it was taking too long to respond. I knocked up the following script file (I might've inserted backslashes at invalid locations, sorry, that's just to ease reading on the blog):</p> <pre class="brush: bash">#!/bin/bash TIME=$( { /usr/bin/time -f %e wget -q -O /dev/null \ http://www.example.com/; } 2&gt;&amp;1 ) TOOSLOW=$(awk "BEGIN{ print ($TIME&gt;2.5) }") if [ $TOOSLOW -eq 1 ]; then echo "The time for this request, $TIME, was greater than 2.5 seconds!" \ | mail -s "Server ping ($TIME sec)" "me@me.com" fi</pre> <p>I set this up as a cron job on my media centre PC (high-tech, I know) to run every 5 minutes. It would email me whenever the website took longer than 2.5 seconds to respond (a "normal" response time is &lt; 0.5 seconds, so I figured 5 times longer was enough).</p> <p>It didn't take long for the emails to start coming through. Then it was a matter of jumping on the server and trying to figure out what the problem was.</p> <h2>First steps</h2> <p>Once the problem was happening, there's a couple of "obvious" first things to try out. The first thing I always do is run top and see what's happening:</p> <pre class="brush:text">top - 08:51:03 up 73 days, 7:45, 1 user, load average: 69.00, 58.31, 46.89 Tasks: 316 total, 2 running, 314 sleeping, 0 stopped, 0 zombie Cpu(s): 11.0%us, 1.3%sy, 0.0%ni, 15.2%id, 72.0%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 8299364k total, 7998520k used, 300844k free, 15480k buffers Swap: 16779884k total, 4788k used, 16775096k free, 6547860k cached</pre> <p>Check out that load! 69.00 in the last five minutes, that's massive! Also of concern is 75% next to "wa", which means 75% of the CPU time was spend in an uninterruptable wait. There's not many things that run in uninterruptable wait (usually kernal threads), and it's usually some I/O sort of thing. So lets see what <code>iotop</code> (which is like <code>top</code> but for I/O) reports:</p> <pre class="brush: text;">Total DISK READ: 77.37 K/s | Total DISK WRITE: 15.81 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO&gt; COMMAND 25647 be/4 apache 73.50 K/s 0.00 B/s 0.00 % 99.99 % httpd 24387 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 23813 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [pdflush] 25094 be/4 root 0.00 B/s 0.00 B/s 96.72 % 99.99 % [pdflush] 25093 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 25095 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 25091 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [pdflush] 24389 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 24563 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 24390 be/4 apache 0.00 B/s 23.21 K/s 96.71 % 99.99 % httpd 24148 be/4 apache 0.00 B/s 0.00 B/s 96.71 % 99.99 % httpd 24699 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd 23973 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd 24270 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd 24298 be/4 apache 0.00 B/s 1918.82 K/s 96.71 % 99.02 % httpd 628 be/3 root 0.00 B/s 0.00 B/s 0.00 % 97.51 % [kjournald] 25092 be/4 root 0.00 B/s 0.00 B/s 0.00 % 96.72 % [pdflush] 24258 be/4 root 0.00 B/s 0.00 B/s 99.99 % 96.71 % [pdflush] 23814 be/4 root 0.00 B/s 0.00 B/s 0.00 % 96.71 % [pdflush] 24388 be/4 root 0.00 B/s 0.00 B/s 99.02 % 96.71 % [pdflush] 25545 be/4 apache 0.00 B/s 0.00 B/s 0.19 % 92.73 % httpd 25274 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 92.38 % httpd 24801 be/4 apache 0.00 B/s 5.84 M/s 99.99 % 91.63 % httpd 25281 be/4 apache 0.00 B/s 5.75 M/s 0.00 % 91.33 % httpd 26115 be/4 apache 0.00 B/s 0.00 B/s 9.60 % 19.26 % httpd 25561 be/4 apache 0.00 B/s 3.87 K/s 0.00 % 9.66 % httpd 26035 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 9.63 % httpd </pre> <p>So all those <code>pdflush</code> commands looked suspicious to me. <code>pdflush</code> is a kernel thread that's responsible for writing out dirty pages of memory to disk in order to free up memory.</p> <p>It was at this point that I was suspecting some kind of hardware failure. Checking the output of <code>sar -d 5 0</code> I could see this:</p> <pre class="brush: text;">Linux 2.6.18-308.1.1.el5PAE (XXX) 23/08/12 08:55:45 DEV tps ... await svctm %util 08:55:50 dev8-0 877.25 ... 179.28 1.14 99.84 08:55:50 dev8-1 0.00 ... 0.00 0.00 0.00 08:55:50 dev8-2 0.00 ... 0.00 0.00 0.00 08:55:50 dev8-3 877.25 ... 179.28 1.14 99.84</pre> <p>Check out that utilization column! 99.84% is really bad (more than 70% or so is when you'd start to have problems)</p> <p>I was at a bit of a loss, because I'm not too familar with the hardware that's running this server (it's not mine) but I knew the disks were in hardware RAID and <code>smartctl</code> wasn't being helpful at all, so I posted <a href="http://serverfault.com/questions/420233/very-high-load-apparently-caused-by-pdflush">a question on Server Fault</a>. At this point, I was thinking it was a hardware problem, but I wasn't sure where to go from there.</p> <h2>My first hint</h2> <p>My first hint was a comment by Mark Wagner:</p> <blockquote><span>Apache PIDs 24801 and 25281 are doing by far the most I/O: 5.84 M/s and 5.75 M/s, respectively. I use </span><code>iotop -o</code><span> to exclude processes not doing I/O</span></blockquote> <p>What <em>were</em> those two processes doing? I opened one of them up with strace:</p> <pre class="brush: text;"># strace -p 24801 [sudo] password for dean: Process 24801 attached - interrupt to quit write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304 write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304 write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304 write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304 write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304</pre> <p>It was just spewing those lines out, writing huge amount of data to file number "26". But what <em>is</em> file number 26? For that, we use the handy-dandy <code>lsof</code>:</p> <pre class="brush: text;"># lsof -p 24801 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME httpd 24801 apache cwd DIR 8,3 4096 2 / httpd 24801 apache rtd DIR 8,3 4096 2 / httpd 24801 apache txt REG 8,3 319380 6339228 /usr/sbin/httpd . . . httpd 24801 apache 26w 0000 8,3 0 41713666 /tmp/magick-XXgvRhzG</pre> <p>So at the end of the output from <code>lsof</code>, we see under the "FD" (for "file descriptor") column, 26w (the "w" means it's open for writing) and the file is actually <code>/tmp/magick-XXgvRhzG</code>.</p> <p>A quick <code>ls</code> on the <code>/tmp</code> directory, and I'm shocked:</p> <pre class="brush: text;">-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXrQahSe -rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXTaXatz -rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXtf25pe</pre> <p>These files are 1.6 <b>terrabytes</b>! Luckily(?) they're sparse files so they don't actually take up that much physical space (the disks aren't even that big), but that's definitely not good.</p> <p>The last piece of the puzzle was figuring out what images were being worked on to use those enormous temporary files. I was thinking maybe there was a corrupt .png file or something that was causing ImageMagick to freak out, but firing up Apache's wonderful <a href="http://httpd.apache.org/docs/2.4/mod/mod_status.html">mod_status</a> and I could immediately see that the problem was my own dumb self: the URL it was requesting was:</p> <p><b>/photos/view/size:320200/4cc41224cae04e52b76041be767f1340-q</b></p> <p>Now, if you don't spot it right away, it's the "size" parameter: <b>size:320200</b>, it's <em>supposed</em> to be <b>size:320,200</b> - what my code does if you leave off the "height" is it assumes you want to display the image at the specified width, but with the same aspect ratio. So it was trying to generate an image that was 320200x200125, rather than 320x200!</p> <h2>The Solution</h2> <p>The solution was, as is often the case, extremely simple once I'd figured out the problem. I just made sure the image resizer never resized an image to be <em>larger</em> than the original (our originals are generally always bigger than what's displayed on the website).</p> <p>The only remaining question was where was this request coming from? The output of <code>mod_status</code> showed an IP address that belonged to Google, so it must've been the Googlebot crawling the site. But a quick search through the database showed no links to an image with the invalid <b>size:320200</b> parameter.</p> <p>At this stage, it's still an open mystery where that link was coming from. The image in question was from an article written in 2010, so it's not something current. In any case, where the link was coming from is of less concern than the fact that it no longer causes the whole server to grind to a halt whenever someone requests it.</p> <p>So I'm happy to leave that one open for now.</p> App Engine task failed for no apparent reason (on the development server) http://www.codeka.com.au/blog/2012/05/app-engine-task-failed-for Tue, 08 May 2012 14:18:42 GMT http://www.codeka.com.au/blog/2012/05/app-engine-task-failed-for <p>So I ran into an interesting issue with App Engine task queues today. They would fail for no apparent reason, without even executing any of my code (but on the development server only). In the log, I would see the following:</p> <pre class="brush: plain;"> WARNING  2012-05-07 11:56:25,644 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.100 seconds WARNING  2012-05-07 11:56:25,746 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.200 seconds WARNING  2012-05-07 11:56:25,947 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.400 seconds WARNING  2012-05-07 11:56:25,644 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.100 seconds WARNING  2012-05-07 11:56:25,746 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.200 seconds WARNING  2012-05-07 11:56:25,947 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.400 seconds</pre> <p>I was a little confused, so I had a look in taskqueue_stub.py around line 1936. I could see basically all it was doing was connecting to my URL and issuing a GET just like I had configured it to. On a whim, I added an extra line of logging so that it now output the following:</p> <pre class="brush: plain;"> WARNING 2012-05-07 11:56:25,643 taskqueue_stub.py:1828] Connecting to: 192.168.1.4 (default host is 0.0.0.0:8271) WARNING 2012-05-07 11:56:25,644 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.100 seconds WARNING 2012-05-07 11:56:25,745 taskqueue_stub.py:1828] Connecting to: 192.168.1.4 (default host is 0.0.0.0:8271) WARNING 2012-05-07 11:56:25,746 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.200 seconds WARNING 2012-05-07 11:56:25,946 taskqueue_stub.py:1828] Connecting to: 192.168.1.4 (default host is 0.0.0.0:8271) WARNING 2012-05-07 11:56:25,947 taskqueue_stub.py:1936] Task task1 failed to execute. This task will retry in 0.400 seconds</pre> <p>Here, the "default host" is what I specified in the command-line for the development server to listen on (all IPs, port 8271). However, 192.168.1.4 is my machine's IP address -- but note there's no port specified. It would try to connect to port 80!</p> <p>I couldn't figure out why until I looked at the code I was calling that was enqueueing the task to begin with. It's an Android app, and it uses HttpCore to manage the HTTP requests it makes to my App Engine app (I'm thinking of switching to the built-in <a href="http://developer.android.com/reference/java/net/URLConnection.html">URLConnection</a> after all, but we'll see...) and the code has the following lines:</p> <pre class="brush: java;">// This is paraphrasing slightly String requestUrl = uri.getPath(); if (uri.getQuery() != null &amp;&amp; uri.getQuery() != "") { requestUrl += "?"+uri.getQuery(); } BasicHttpRequest request = new BasicHttpRequest(method, requestUrl) request.addHeader("Host", uri.getHost()); // ...</pre> <p>You might see the problem already...</p> <p>The problem is, we were setting the "Host" header to 192.168.1.4 (no port number!) and the task library was then using the Host header we supplied to know which host to connect to in order to execute the task.</p> <p>The solution is actually simple:</p> <pre class="brush: java;">String host = uri.getHost(); if ((uri.getScheme().equals("http") &amp;&amp; uri.getPort() != 80) || (uri.getScheme().equals("https") &amp;&amp; uri.getPort() != 443)) { host += ":"+uri.getPort(); } request.addHeader("Host", host);</pre> <p>Basically, if we're using the default HTTP or HTTPS port, don't include it in the Host field. Otherwise, make sure the port number is included.</p> <p>This would never be a problem if you were executing tasks from a browser (or using a more high-level library that auto-populated the Host header for me!) so it's understandable that I couldn't really find any other people having this problem. Maybe I <em>should</em> just switch to using <code class="codespan">URLConnection</code></p>