<?xml version="1.0" encoding="utf-8"?>
<!-- If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/ -->
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:lj="http://www.livejournal.com">
  <id>urn:lj:livejournal.com:atom1:unperson_</id>
  <title>Not what it could have been</title>
  <subtitle>Jon's life and eventual death.</subtitle>
  <author>
    <name>Jon O</name>
  </author>
  <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/"/>
  <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom"/>
  <updated>2008-08-09T15:34:50Z</updated>
  <lj:journal username="unperson_" type="personal"/>
  <link rel="service.feed" type="application/x.atom+xml" href="http://users.livejournal.com/unperson_/data/atom" title="Not what it could have been"/>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:4434</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/4434.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=4434"/>
    <title>BT</title>
    <published>2008-08-09T15:32:48Z</published>
    <updated>2008-08-09T15:32:48Z</updated>
    <content type="html">It has been quite a while since I bothered updating this. Mostly my life is insanely dull, and I have little to nothing that I feel that I need to tell the world about. Today I feel that I need to complain about BT.&lt;br /&gt;&lt;br /&gt;Last month I moved flats, and the new place needs a phone line. For starters I was quite irritated to discover that as a new build I need to pay BT £130 to activate the line that the developers have already paid them to install. After accepting that there was no way I could avoid this I placed the order for my line. In an attempt to make BT slightly less impossible to compete with the company has been split into several different bits, and my fun and exciting point of contact as an end-user is BT Retail. BT OpenReach are responsible for actually installing and activating the line.&lt;br /&gt;&lt;br /&gt;BT OpenReach decided they needed to send someone round to identify the pairs for my line, and BT Retail scheduled the visit for two weeks time. Two days later I'd had a card from OpenReach through the door with a local number to call on it. I was out of the country for a week, but on my return called said number. 10 minutes later I had an OpenReach engineer on my doorstep! So far, well above my expectations.&lt;br /&gt;&lt;br /&gt;The engineer prodded about for a while, put a tone on the line to allow them to identify it, and then cleared off saying "and if it's not live by the day after tomorrow make sure you call customer services."&lt;br /&gt;&lt;br /&gt;Two days later the line's still dead so I call BT Retail. The "to check the status of an order" IVR menu asks for my number, but claims that the order on my line is complete and hangs up on me. I open a fault and am told that an engineer will be round some time in the next week. Fortunately their web site now has access to the fault tracking system, so I see 15 minutes later my fault gets closed.&lt;br /&gt;&lt;br /&gt;A week later I have no engineer so I phone back. I'm told that because the faults was closed no one was sent out. I'm told an engineer will be with me the next day.&lt;br /&gt;&lt;br /&gt;The following day the fault tracking status claims that I wasn't in when the engineer called. I call back and I'm told that no engineer was sent out. I'm told I'll have one the following day (yesterday.)&lt;br /&gt;&lt;br /&gt;No one shows up, on-line system claims I wasn't in when the engineer arrived. I use the on-line fault tracker to reschedule for today.&lt;br /&gt;&lt;br /&gt;No engineer shows up, the on-line system claims I wasn't in. I call the faults line and eventually speak to someone who says "oh, OpenReach have the order still open pending something from us, but we've closed the order." Great. OpenReach haven't been sending engineers because as far as they're concerned the order isn't finished. Retail reckon it has, and have already started billing me. I'm told that as the order isn't complete BT Retail Faults desk can't help me, I have to speak to customer services.&lt;br /&gt;&lt;br /&gt;Unfortunately the attempt to put me through to customer services causes my call to immediately disconnect. A quick google shows me that this is typical for calls to CS (08000223089l.) Phoning them myself got me put through to someone who refused to do anything useful, and insisted on putting me back through to faults. Who told me there was nothing they could do, and put me on hold for an eternity. Allegedly someone from Customer Services will call me "in the week" so I hold out no hope of ever getting a working phone line.&lt;br /&gt;&lt;br /&gt;Not happy.&lt;br /&gt;&lt;br /&gt;Meanwhile my Windows box seems to have developed a problem with the USB controller where devices get disconnected if they start doing more than 1-2Mbit/s throughput - this is not good when you have a Vodafone 3G card and live in London where you can easily get a good 5Mbit/s.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:4134</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/4134.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=4134"/>
    <title>Perhaps.</title>
    <published>2007-04-06T00:18:36Z</published>
    <updated>2007-04-06T00:18:36Z</updated>
    <category term="hate"/>
    <content type="html">Perhaps.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:3897</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/3897.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=3897"/>
    <title>No.</title>
    <published>2006-12-30T13:30:29Z</published>
    <updated>2006-12-30T13:30:29Z</updated>
    <content type="html">No.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:3730</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/3730.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=3730"/>
    <title>Please fix your laptop.</title>
    <published>2006-05-03T19:19:41Z</published>
    <updated>2008-08-09T15:34:50Z</updated>
    <content type="html">The MD has a laptop on which he has managed to install the largest collection of system tray applications I have ever seen. He regularly complains that it's slow, but won't allow us near it to try and clear all the junk off. This led to an irritation a few weeks ago where he declared that he needed 2GB of RAM in it. Despite us insisting that he didn't need it we were forced to order 2x 1GB SODIMMs at a stupidly high price. Unfortunately we only discovered after receiving said parts that his laptop only has one easily accessible SODIMM slot. The other is internal, and requires dismantling of the laptop. He wont let us take it apart, so he now only has 1.5GB of RAM (it already has 512MB in the internal slot.) This is, apparently, not good enough. He wants us to order a 2GB SODIMM, and won't accept that (a) his laptop will NOT take a DIMM larger than 1GB and (b) SODIMMs that large aren't available.&lt;br /&gt;&lt;br /&gt;Due to the huge number of random bits of badly written software he has running constantly in his systray his laptop gets slower over time and requires a daily reboot. I have two Windows desktops and a Linux desktop. The Windows desktops get rebooted monthly when the MS patch farm comes out, and the Linux box gets rebooted when (and if) there's a critical kernel bug that affects me, or some new feature I need in a newer kernel than I'm running. I don't have strange slowdown problems because I don't run hordes of daft apps constantly. Unfortunatey today he's decided that what's true for his laptop must be true of our servers. He wasted a considerable chunk of my time today arguing about this, and is insisting that we reboot everything weekly! I'm hoping he forgets about it. If he doesn't then we could have quite a lot of pain. Sugestions for convicing him otherwise, without having to demonstrate it, would be much appreciated.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:3364</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/3364.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=3364"/>
    <title>Of fire supression systems and datacentres</title>
    <published>2006-03-17T22:18:59Z</published>
    <updated>2006-03-17T22:21:48Z</updated>
    <content type="html">It has been a while since I last updated this beastie. I've had a few busy weeks, and a week of feeling down and not doing much. I've also been trying to sort out various bits of paperwork left over from Woaf Tech, and help the parents start to move house... all a bit of a pain. Last week I had what felt like a week from hell. Everything seemed to be breaking in irritating ways, and I was on silly hours (0730 start) as my co-syadmin (who normally covers the morning) was away. The events of the Thursday, however, topped off what was already a bad week:&lt;br /&gt;&lt;br /&gt;Shortly after lunch one of our developers (A) recieved a paniced phone call from one of our data feed suppliers (C). It went something like this:&lt;br /&gt;(C) I'm afraid you're about to lose all service from us.&lt;br /&gt;(A) Oh, realy? Why?&lt;br /&gt;(C) The datacentre's on fire.&lt;br /&gt;(A) Oh, OK then.&lt;br /&gt;&lt;br /&gt;&lt;a name="cutid1"&gt;&lt;/a&gt;&lt;br /&gt;The message is promptly relayed to me, and it's time for me to have a good panic. The datacentre in question is Level3's Braham Street facility. We have kit there to land the feeds from (C) and it's also our offsite backup and DR site. Great. I've just spent the previous week implementing a new backup system using the quite acceptable Bacula (&lt;a href="http://www.bacula.org/"&gt;http://www.bacula.org/&lt;/a&gt;) and all my work seems about to go up in smoke. Losing feeds from this supplier would also be quite a bad thing. There are alternatives, and we set about implementing them just in case we should lose the primary.&lt;br /&gt;&lt;br /&gt;Fortunately none of our kit vanished, and we never lost the feeds from (C).&lt;br /&gt;&lt;br /&gt;As the afternoon progressed we discovered that the real problem was the fire suppresion system. The sprinkler system discharged on the 2nd floor and the water  flooded down to the 1st floor and ground floor. Unfortunately power to a small area of the datafloor on the 1st was lost as a result, which is why we had the paniced phone call from (C) who had some equipment affected and were sure they were going to lose the rest. I wandered to Braham Street that evening to take a look, to find the reception area was full of wheely bins collecting water dripping from the ceiling, and bumped into a couple of very pissed off looking Level3 engineers who confirmed what'd happened. There hadn't been a fire, but something had tripped the fire supression system and caused it to charge the dry risers with water. A pipe ruptured under the pressure on the 2nd floor, resulting in the flood.&lt;br /&gt;&lt;br /&gt;And now the bitching and moaning:&lt;br /&gt;We don't deal with Level3 directly, insted we buy from mNet. mNet are therefore responsible for keeping us informed about potential service affecting problems. mNet completely failed to inform us about the events at Braham Street, insted we found out via (C). When I phoned mNet to find out how serious the damage was and if we were likely to be affected I was told very little (although they confirmed that they were aware of a possible fire, and flooding of the 1st floor), but promised updates as soon as they knew what was going on. They never called me back or emailed me. I was left to find out from TheRegister (&lt;a href="http://www.theregister.co.uk/"&gt;http://www.theregister.co.uk/&lt;/a&gt;) the following day what the official Level3 statement contained. They still haven't contacted me, and I'm tempted to phone them to chase the ticket.&lt;br /&gt;&lt;br /&gt;This isn't the first time that mNet support have sucked in this way.&lt;br /&gt;&lt;br /&gt;AAAaand onto a related note:&lt;br /&gt;We're looking to move one of our daughter sites off our main network and into another datacentre, becuase we're fed up with dealing with Colt. I took my boss arround Redbus HEx, Sov and Telehouse. I've spent too much time in all three in a past life, but my boss has never been to any of them. Telehouse managed an extremely poor showing, despite the initial impression based on size of the facility and the level security. They've still not fixed the aircon system, so it was excessively hot in TFM40. They also have normal water sprinklers for fire suppresion, which was a major negative point after our fun at Braham Street.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I was going to tack something onto the bottom of this about my experiences with Bacula, but I think this has got a bit too long already. Maybe next time.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:3288</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/3288.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=3288"/>
    <title>eAccelerator duplicate keys</title>
    <published>2006-02-21T08:29:19Z</published>
    <updated>2006-02-21T08:29:19Z</updated>
    <content type="html">Yay. Patch to fix the dup keys problem has been commited to eAccel CVS. If you don't want to use the CVS version you can grab the patch from &lt;a href="http://growler.woaf.net/eaccel/cachedups.diff"&gt;http://growler.woaf.net/eaccel/cachedups.diff&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;LAME HACK ALERT:&lt;br /&gt;&lt;br /&gt;I also have a patch for eAccel to fix another minor irritation. eAccel stores cached scripts and keys in two different hash tables. The hash tables are made up of 256 buckets, each containing a linked list of the elements that hash to that bucket. We store somewhere in the region of 2000 scripts and 20,000 keys in eAccel. If you know anything about hash tables you'll immediately spot the obvious problem. &lt;a href="http://growler.woaf.net/eaccel/confighashsize.diff"&gt;http://growler.woaf.net/eaccel/confighashsize.diff&lt;/a&gt; will allow you to configure those hash sizes using the eaccelerator.hash_size and eaccelerator.user_hash_size config options.&lt;br /&gt;&lt;br /&gt;The bucket number (slot) is calculated by taking the output of mm_hash() and bitwise anding it with (user_)hash_max-1. This means that user_hash_max and hash_max MUST be a power of two, or you'll end up not using some of your buckets.&lt;br /&gt;&lt;br /&gt;Please note the second patch is a very lame hack, and cannot be considered a real fix. I might implement a real solution later today.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:3067</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/3067.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=3067"/>
    <title>mmcache and eAccelerator with duplicate keys</title>
    <published>2006-02-18T16:49:32Z</published>
    <updated>2006-02-18T17:00:34Z</updated>
    <content type="html">We've been having interesting performance issues over the last couple of weeks. The average page generation time for our site increased proportionally with the cumulative number of hits (i.e. over time the page generation time went up, and it went up more rapidly during peak times.) After about 4 hours during the day, or about 8 hours overnight, the average page time was up to a second and required a restart of Apache on our server farm in order to correct it.&lt;br /&gt;&lt;br /&gt;This baffled me for a week or so. I tried various things, including switching from mmcache (which is no longer maintained) to eAccelerator (&lt;a href="http://eaccelerator.net/"&gt;http://eaccelerator.net/&lt;/a&gt;). eAccelerator is a fork of mmcache, with the aim of continuing development to support PHP5 and fix various irritating bugs. As we've had apache crashes linked to mmcache on a regular basis I'm hopeful the switch will make life a bit easier for me. I'll probably post something on here about my experiences with it. &lt;br /&gt;&lt;br /&gt;Unfortunately that didn't fix my problem.&lt;br /&gt;&lt;br /&gt;Poking at the eAccel diagnostics page for a while I finally realised what was happening. The number of keys we were caching (objects that we'd explicitly stored in cache) was increasing over time, and eventually consuming enough memory to start pushing compiled PHP scripts out of cache. As more pre-compiled scripts are forced out of cache it's obvious that page load times are going to start increasing, as the PHP scripts will have to be recompiled for each execution. Eventually enough critical scripts were being pushed out of memory to cause massively high page load times. Restarting Apache clears the key cache and the process starts over. (Note: we don't use disk cache, only shm cache, due to problems with the speed of the ext3 file system (see previous post on the subject.))&lt;br /&gt;&lt;br /&gt;Yay. But what was doing it? A quick grep of our source showed a limited number of places where keys were stored in cache, and almost all of those had sensible time-to-live (TTL) set for them. Immediate thought is a horrible bug in mmcache/eAccel. Panic! Poking at their mailing lists finds that recompiling eAccel with the --enable-disassembler configure option makes the eaccelerator() function (produces a diagnostics page) show the list of cached keys and their contents. Excellent.&lt;br /&gt;&lt;br /&gt;It was immediately apparent that we were rapidly inserting many copies of one key into the cache. Due to the cache structure it's possible to have multiple entries with the same key. Unfortunately the eaccelerator_put() function doesn't check for key uniqueness. While the entries in question were being inserted with a TTL (300 seconds) the sheer number puts being done for that key, the comparatively large value it stored, and the long TTL, led to the extremely high cache usage.&lt;br /&gt;&lt;br /&gt;The root of the problem was a daft typo:&lt;br /&gt;$foo = eaccelerator_get("key");&lt;br /&gt;if (!$foo)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;$foo = generateFoo();&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;eaccelerator_put("key", $foo);&lt;br /&gt;&lt;br /&gt;Note that this is PHP, not Python. Indenting is no substitute for actually putting the correct braces in the code!&lt;br /&gt;&lt;br /&gt;Fortunately we don't call eacclerator functions directly, caching all goes through our own library that does a bit of error checking and allows easy switching between caching systems. Adding an eaccelerator_rm($key) to the top of the put function has ensured that mistakes like this wont cause us too much pain in the future.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:2686</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/2686.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=2686"/>
    <title>Windows</title>
    <published>2006-02-16T22:11:53Z</published>
    <updated>2006-02-16T22:11:53Z</updated>
    <content type="html">No, it's not your typical Windows rant :p&lt;br /&gt;&lt;br /&gt;Some months ago I decided it would be a good idea to get a Windows 2003 Server install for the office and have an AD domain. This was intended to make life a lot simpler for us than the current mess of independent systems on which the users have superuser accounts. I am no fan of Windows, and believe that the best solution would be to migrate the office desktops to Linux. Unfortunately the users think otherwise, and if I want to run an AD domain I have to have a Windows server. Fine, great, I'll do it. Last week we finally got around to ordering Win2k3 and enough CALs for the entire office. On Friday they arrived. But... erm, they don't work.&lt;br /&gt;&lt;br /&gt;It turns out that the Volume Licence Key (VLK) that we've been issued is for Windows 2003 R2 (for R2 read "with service packX.") The media kit (£20 install CD which requires a VLK) is Win2k3 R1. The R2 VLK does not work with the R1 media kit. So, obviously the supplier has made a silly mistake, or I selected the wrong media kit when ordering. I called them earlier this week to enquire about the problem, and later in the day recieved an interesting reply. It turns out that the media kit for R2 will not be available until April.&lt;br /&gt;&lt;br /&gt;OK, this seems a little daft. Why release the R2 VLKs several months before the media kit? Nevermind, I'll just have an R1 VLK and use that with my R1 media kit. Oh, erm, no I wont, MS have stopped issuing R1 VLKs! Until April it's now impossible to buy a working copy of Win2k3 server under the OPEN licence scheme. Great. Time to warez a copy.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:2376</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/2376.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=2376"/>
    <title>White Van Man</title>
    <published>2006-01-10T21:00:55Z</published>
    <updated>2006-01-10T21:00:55Z</updated>
    <content type="html">Today I had the wonderful task of shifting a large quantity of servers between the datacentre and the office. This required a van! Sadly no rental place seems to want to hire out a 7.5tonner to a 23 year old, but I did find somewhere that'd hire me a bog standard transit like van thing. Complete overkill, of course, but it was cheap. So, today I became White Van Man!&lt;br /&gt;&lt;br /&gt;* People like trying to overtake vans, even if the van in question is accellerating hard (empty) and is already exceeding the speed limit.&lt;br /&gt;* Before violently overtaking, the Merc or BMW driver will ensure that they've spent some time tailgating. This ensures that the van driver can't see them, and may well have forgotten that they exist.&lt;br /&gt;&lt;br /&gt;The quantity of damage to the hire van's mirrors ensured that I had little to no idea of what was behind or to the side of me. This made changing lanes a little exciting, and reversing (especially on Throgmorton Street, where there's cast iron posts along the side of the narrow, single lane, road) hilarious. It also made trying to work out what was alongside me more difficult, a major problem considering BMW and Merc driver behaviour.&lt;br /&gt;&lt;br /&gt;Not knowing the roads too well led to many late lane changes. These are more challenging than they sound if the van is full, as you realy want to avoid stopping at all costs (stopping being difficult, and starting again being slow)!&lt;br /&gt;&lt;br /&gt;Bleah. This did have a point, but I've forgotten what it was.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:2108</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/2108.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=2108"/>
    <title>Bored</title>
    <published>2005-12-26T17:31:16Z</published>
    <updated>2005-12-26T17:31:16Z</updated>
    <content type="html">BOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOORED!&lt;br /&gt;&lt;br /&gt;That concludes this entry.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:1796</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/1796.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=1796"/>
    <title>Dell PERC</title>
    <published>2005-12-22T19:28:32Z</published>
    <updated>2006-02-13T12:38:05Z</updated>
    <content type="html">The Power Edge RAID Controller (PERC) is the name for a variety of RAID controllers used by Dell in their servers. All of the ones I have experience with are LSI Megaraid based. The current generation of cards, supporting SCSI U320 drives, is the PERC4. The previous generation, U160, was the PERC3. The PERC3 was not a particularly bad card. While it only supports U160 drives, and is therefore horribly outdated, it didn't have too awful performance considering how little it cost. It also had a reasonably complete feature set, supporting RAID0, 1, 5 and 10. You'd imagine that the PERC4 would be no worse, and would also support the same features. Unfortunately that's not the case, despite what the sales froth claims.&lt;br /&gt;&lt;a name="cutid1"&gt;&lt;/a&gt;&lt;br /&gt;We have a couple of servers with rather high disk performance requirements, which also happen to require a lot of disk space. In the past we've used RAID0 for these servers and lived with the pain and suffering caused by a drive failure. Recently, however, the time required to rebuild one of these beasts has got a triffle high and as a result we've looked to try and reduce the likelyhood of failures. RAID10 was the obvious choice.&lt;br /&gt;&lt;br /&gt;As an entirely Dell shop we decided to kit out a couple of 2850s (2U, 6 bays) with six 300GB U320 drives hanging off a PERC4. Stuck into RAID10 we ought to get reasonable performance, redundancy, and about 850GB of space to play with. Just what we want! Unfortunately this was not to be the case. Performance sucked for no aparent reason, despite our best efforts at tweaking it. The array was showing about the same performance as I'd expect from a single drive! Immediate conclusion was we'd done something wrong, but we couldn't work out what. Much Googling later, wading through many articles about how awful the PERC4 performance is, we find the explanation.&lt;br /&gt;&lt;br /&gt;The PERC4 does NOT do RAID10! What the PERC4 describes as RAID10 is infact JBOD+1. Yes, that's right, not even RAID01, but a new (and completely useless) setup involving JBOD. Do NOT buy a PERC4 if you expect reasonable performance from RAID! This fact is hidden in the Dell documentation at &lt;a href="http://docs.us.dell.com/support/edocs/software/smarrman/marb32/ch3_stor.htm#1037043"&gt;http://docs.us.dell.com/support/edocs/software/smarrman/marb32/ch3_stor.htm#1037043&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Why does JBOD+1 suck? JBOD is "Just a Bunch of Drives" or something like that. It's a collection of drives to produce a large quantity of storage as simply as possible. Data starts filling up the first drive and once that's full it moves onto the second, and so on. Adding RAID1 to this just means you have a pair of JBODs which contain identical data. Performance is good for random reads distributed over the entire array (so when your array is nearly full, and you want data from all over the place), but is only as good as a single drive for writes.&lt;br /&gt;&lt;br /&gt;Our usage of these large arrays is for historical data. All new data is inserted onto the end of the datafiles, and the majority of reads are random but are almost entirely limited to the most recent few weeks of data. This means that all of our writes and almost all of our reads happen on the same drive pair in the array, giving us RAID1 performance at absolute best. That's not fast enough, it sucks, grrrr.&lt;br /&gt;&lt;br /&gt;So, in conclusion, Dell suck and like to redefine well documented RAID levels to increase their profit margins.&lt;br /&gt;&lt;br /&gt;I'm considering a move to IBM.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:1778</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/1778.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=1778"/>
    <title>Driving in London</title>
    <published>2005-12-06T21:39:12Z</published>
    <updated>2005-12-07T17:45:59Z</updated>
    <content type="html">Yesterday I was happily sitting in a Taxi journeying from our datacentre (Wapping) to our office (Bank) during the morning rush hour. While we sat in a traffic jam on The Highway I made the interesting observation that the only vehicles participating in our traffic jam were:&lt;br /&gt;a) Vans&lt;br /&gt;b) Taxis&lt;br /&gt;c) Mercs&lt;br /&gt;d) BMWs&lt;br /&gt;Category c and d were almost entirely populated only by their driver, who was invariably on their mobile phone, and apearred to be determined to drive through as many stationary objects and taxis as possible.&lt;br /&gt;&lt;br /&gt;Why do these idiots bother? What is the point of trying to drive into the City for work every morning? Is it to show off the expensive executive car? Is it to justify having purchased the expensive executive car to yourself/your spouse? Why, dammit, whyyyyyyyyy? It's almost quicker to walk! Public transport (almost without fail) will be faster, and even if you don't live near a station there are plenty with large carparks that you can drive to.&lt;br /&gt;&lt;br /&gt;Grrr.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:1379</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/1379.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=1379"/>
    <title>Public transport</title>
    <published>2005-12-01T19:31:13Z</published>
    <updated>2005-12-01T22:20:26Z</updated>
    <content type="html">Normally I'm extremely happy with public transport in London. Unfortunately today the world seemed to be set against me.&lt;br /&gt;&lt;br /&gt;(I'm Growler)&lt;br /&gt;&lt;br /&gt;18:22 &amp;lt;@Growler&amp;gt; on way home.&lt;br /&gt;&lt;br /&gt;one hour passes.&lt;br /&gt;&lt;br /&gt;19:22 &amp;lt;@Growler&amp;gt; fucking hell&lt;br /&gt;19:22 &amp;lt;@Growler&amp;gt; that was a mission!&lt;br /&gt;19:22 &amp;lt; Spark_&amp;gt; what happen&lt;br /&gt;19:22 &amp;lt;@Growler&amp;gt; Northern line bank branch closed due to signal failure&lt;br /&gt;19:23 &amp;lt;@Growler&amp;gt; Central line severely delayed due to signal failure&lt;br /&gt;19:23 &amp;lt;@Growler&amp;gt; and therefore packed to the brim&lt;br /&gt;19:23 &amp;lt;@Growler&amp;gt; I board and travel to Liverpool street before deciding the sardine thing just isn't for me&lt;br /&gt;19:23 &amp;lt;@Growler&amp;gt; Liverpool street is crammed full of people&lt;br /&gt;19:23 &amp;lt; Spark_&amp;gt; heh&lt;br /&gt;19:23 &amp;lt;@Growler&amp;gt; fight my way up to the surface to find that all trains are canceled due to a fire&lt;br /&gt;19:24 &amp;lt;@Growler&amp;gt; they then evacuate the station before I can get back on the underground&lt;br /&gt;19:24 &amp;lt;@Growler&amp;gt; walk back to Bank.&lt;br /&gt;19:24 &amp;lt; Spark_&amp;gt; haha&lt;br /&gt;19:24 &amp;lt;@Growler&amp;gt; find the number 25 bus is heavily delayed and packed becuase half of their bendy busses are broken down at the roadside at various stops along the 25 route&lt;br /&gt;19:25 &amp;lt;@Growler&amp;gt; decide to take the DLR&lt;br /&gt;19:25 &amp;lt; Spark_&amp;gt; lol&lt;br /&gt;19:25 &amp;lt; Spark_&amp;gt; thats the last one&lt;br /&gt;19:25 &amp;lt;@Drakon&amp;gt; heh&lt;br /&gt;19:25 &amp;lt;@Growler&amp;gt; DLR is closed up to Stratford due to the fire at Pudding Mill Lane&lt;br /&gt;19:25 &amp;lt;@Drakon&amp;gt; I hate days like that&lt;br /&gt;19:25 &amp;lt;@Growler&amp;gt; ahha, Jubilie line from Waterloo!&lt;br /&gt;19:25 &amp;lt;@Growler&amp;gt; nope, some fucker promptly lobs himself in front of a Waterloo and City service&lt;br /&gt;19:25 &amp;lt;@Growler&amp;gt; W&amp;C closes.&lt;br /&gt;19:25 &amp;lt; Spark_&amp;gt; lol&lt;br /&gt;19:26 &amp;lt;@Drakon&amp;gt; He was probably depressed about the transport that day&lt;br /&gt;19:26 &amp;lt;@Growler&amp;gt; Bank continues to fill with people who are arriving to find all but the Central line closed&lt;br /&gt;19:26 &amp;lt; Spark_&amp;gt; no northern line&lt;br /&gt;19:26 &amp;lt;@Growler&amp;gt; board the central line eventually&lt;br /&gt;19:26 &amp;lt;@Growler&amp;gt; they're announcing that LP St is closed, and advising people to take the Central to Stratford for mainline services&lt;br /&gt;19:26 &amp;lt; Spark_&amp;gt; you should have taken the district line to westham and gotten the jubillee&lt;br /&gt;19:27 &amp;lt;@Growler&amp;gt; LP St opens as we get there, but mainline still closed so loads of people board to get to Stratford&lt;br /&gt;19:27 &amp;lt;@Growler&amp;gt; suffer pain till Stratford&lt;br /&gt;19:27 &amp;lt;@Growler&amp;gt; arrive Stratford to find that it's full of people becuase everyone is trying to get a mainline service from there&lt;br /&gt;19:27 &amp;lt;@Growler&amp;gt; unfortunately the mainline services at Stratford are also suspended.&lt;br /&gt;19:27 &amp;lt;@Growler&amp;gt; fight my way out of the station, through the market, past a broken down number 25, walk home :P</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:1209</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/1209.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=1209"/>
    <title>ext3, reiserfs, mmcache</title>
    <published>2005-11-30T23:11:05Z</published>
    <updated>2005-11-30T23:11:17Z</updated>
    <content type="html">This weeks unfun has been caused by the extremely poor performance of ext3.&lt;br /&gt;&lt;br /&gt;Some weeks ago we started experiencing a few performance problems with all but one of our web servers. The one magic web server (we'll call it 37) had a few minor differences from the rest of the web farm. It'd been re-tasked from another of our web farms to try and improve the performance of the main farm while a load of servers were offline for various reason (mainly power (thanks Colt (they've still not delivered))), and we'd just whacked the various files and configs we needed on it. The obvious differences were that it was 200Mhz per processor more powerful than the others (dual 3Ghz vs dual 2.8Ghz Xeons (same family and cache)), and used reiserfs for its partitions insted of ext3. This server was managing to process twice the number of hits that any of the others could manage, and was still barely noticing the load! Most of the web servers were seeing 50% sys usage, but 37 was sitting happily under 5%. Not good.&lt;br /&gt;&lt;br /&gt;Obvious conclusions were either we'd missed some config option, 200Mhz realy did make a lot of difference, or that ext3 was shit.&lt;br /&gt;&lt;br /&gt;Important things to know at this point are that most of our content is dynamic (PHP), and that to make PHP perform anything like sensibly we use turck-mmcache. As a result the PHP scripts are all byte-code compiled and cached in memory, resulting in almost no disk IO. Logging is done remotely, so that doesn't hit the disk either. iostat quite clearly showed bugger all disk activity, which seemed to rule out file system. Diffing various config files showed a few minor differences, but fiddling with those had no real effect.&lt;br /&gt;&lt;br /&gt;So, erm, system time. It's time to play with strace!! For those that don't know, "strace -c" sits about until either the process terminates, or you kill strace, and then prints out a table detailing how many times each syscall was made, and the average time spent in each syscall. This is absolutely invaluable when trying to sort out problems like this. The other nice feature of strace is that you can attach it to any process that's currently running using "strace -p &lt;pid&gt;". We attached strace to a few apache processes on various servers and got back some interesting results. All servers were doing a lot of unlinking (deleting files), and the ext3 servers were spending a good 0.1 seconds in each strace call, compared to a fraction of that on the resierfs box. Fortunately strace also shows the file name that was being unlinked, making it very obvious what was triggering the unlinks - mmcache.&lt;br /&gt;&lt;br /&gt;By default mmcache caches the compiled bytecode, and anything else you ask it to store, in both shared memory AND on disk. When it starts to run low on shared memory it can flush older data out of memory safe in the knowledge that there's still a copy on disk. Items stored in mmcache can also have expiry times, and we make very, very heavy use of mmcache to cache stuff for short periods of time. When the mmcache garbage collector runs it'll purge expired items from memory and from disk. Becuase of the huge number of items we were caching, and expiring, the number of unlinks for the old cache files was quite high. Ooopse. Even worse, the number of files in the cache directory was rather extreme. Ext3 sucks at unlinks at the best of times, but just generally sucks when it comes to directories with large numbers of small files in them. Reiserfs, on the other hand, is designed to deal with directories containing large numbers of small files, so performs excellently.&lt;br /&gt;&lt;br /&gt;Ah well, all fun. The fix was to force mmcache to only use shared memory (we have large enough mmcache shm caches that we don't ever need to flush things so the disk cache was pointless.) Unfortunately the options you actually need to do this aren't documented!! Two of them are in the docs and example configs, but the other two require you to dig about in the source :)&lt;br /&gt;&lt;br /&gt;Here's the beasties:&lt;br /&gt;mmcache.shm_only="1" ; documented&lt;br /&gt;mmcache.content="shm_only" ; documented&lt;br /&gt;mmcache.keys="shm_only" ; undocumented, stuff stored using mmcache_put&lt;br /&gt;mmcache.sessions="shm_only" ; undocumented, stuff stored for mmcache sessions (we don't use them, but just in case)</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:785</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/785.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=785"/>
    <title>Colt</title>
    <published>2005-11-25T09:11:39Z</published>
    <updated>2005-11-25T09:11:39Z</updated>
    <content type="html">Colt Telecom are unbelieveably awful. In the last year they have failed to deliver anything we've ordered on time. It shouldn't take 7 months to get another rack in one of their datacentres! Not only do they fail to deliver on time, there's absolutely no feedback regarding the status of the order. You can phone daily and get the "oh yes, it'll be done by the promise date" up to the promise date itself, then your account manager will start avoiding talking to you as the order takes another month or so (with no useful status updates, or dates) to complete.&lt;br /&gt;&lt;br /&gt;After a series of problems, including a LES100 install taking 3 months, we had a meeting with them. At this meeting they promised that delays and screwups would be a thing of the past. They've promptly stuffed up the next two orders.&lt;br /&gt;&lt;br /&gt;The latest fun has been a power upgrade for 2 of our racks to dual 16A resiliant feeds which was supposed to be completed last week. I recieved an email telling me that the order had been completed (on time for once!) and asking me to visit the datacentre and inspect the work. I joyfully showed up thinking that perhaps Colt had got it right at last, only to find that the work had not been carried out at all! According to their engineer the order was passed to him, a work order was raised and then, just before the work was scheduled to be carried out, someone canceled the order. I'd be considerably less pissed off if they'd noticed their mistake, reactivated the order and told us to wait another few weeks, but insted our account manager tells us that the work is done! Is he trying to upset us!? What sort of ordering system do they have where it's possible to confuse "cancelled" with "done"? Why has our account manager insisted that the work was on target every time I called? Does he actually get any feedback at all about orders once he's managed to get us to sign? If not then (a) why not and (b) why does he claim that it's all going swimmingly when he doesn't realy know? Of course I haven't been supplied with a new promise date yet, it'll be done when it's done aparently.&lt;br /&gt;&lt;br /&gt;Failing to meet deadlines and failing to provide even rough estimates of when work will actually be done makes project planning a nightmare. Until some time next year I have an exceptionally busy schedule, and Colt causing reshuffling of projects is a serious headache. I also have other people depending on my projects, and they're getting very pissed off with repeated delays caused by Colt, or caused by their projects getting reshuffled to fit arround the Colt stuff I have to do.&lt;br /&gt;&lt;br /&gt;Unfortunately we're stuck in a Colt datacentre. Everything we buy has to be Colt (argh) or BT (very expensive) becuase no other carrier has any kit there. Moving is going to be exceptionally difficult, and very expensive (there's a load of LES, STM-1s, E1s and things that'd have to be moved.) Argh.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:733</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/733.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=733"/>
    <title>VPNs</title>
    <published>2005-11-22T08:57:14Z</published>
    <updated>2005-11-22T08:57:14Z</updated>
    <content type="html">Once again I find myself updating this for no aparent reason. I've woken up far too early with a serious hangover and can't remember large chunks of last night. Possibly mixing wine and beer is a bad thing. I'm wondering if perhaps I might be drinking far too much.&lt;br /&gt;&lt;br /&gt;In other news the Cisco PIX has possibly the worst configuration I have ever had the misfortune to poke at. If I didn't need an office VPN that the users can actually use then I'd lob the damn thing in a skip and run away. Unfortunately there apear to be no good, cheap, VPN solutions (although my research was rather limited.) I originally decided that I should just use Linux, but pain and suffering were the result:&lt;br /&gt;&lt;br /&gt;The goals:&lt;br /&gt;a) Have a VPN from the office to the datacentre for certain development and admin traffic.&lt;br /&gt;b) Have a VPN system that the developers can log into from home to carry out development work.&lt;br /&gt;&lt;br /&gt;My initial solution was to add IPSec support to the kernel on the office firewall. This and a Linux box at the datacentre worked perfectly to provide goal a. Unfortunately goal b is far harder than it sounds. The inbuilt Windows and Mac OS X VPN clients use either PPTP (well known to be insecure) or L2TP over IPSec. There's quite a bit of documentation out there regarding L2TPoIPSec on Linux so it was reasonably easy to set up. Unfortunately the Linux 2.4 IPSec stack seems to suffer from several major irrations. Firstly it doesn't work with the Mac OS X VPN client due to a screwup by Apple, who implemeneted one of the drafts of the IPSec standard insted of the final standard, rendering their product completely incompatable with just about everything else out there. Secondly setting up L2TPoIPSec under Windows is a pain and relies on the user being able to accurately follow simple instructions (just not gonna happen.) Thirdly it can't handle more than one system behind the same NAT. This third problem also leads to another major problem; when a user is disconnected from the VPN (it just drops occasionaly, suspect a Linux or Windows bug) they can't reconnect until the original SA has timed out on the Linux box (about 20 mins) becuase it looks just like a second client from the same IP. I believe there's some limitation in the datastructures used for storing SAs under Linux which can't cope.&lt;br /&gt;&lt;br /&gt;Not willing to give up with Linux at that point I switched to a 2.6 kernel and applied the Stinghorn patch set, which are aimed at getting a sane L2TPoIPSec VPN working. Unfortunately applying those patches breaks goal a (yay), so I had to deploy it on a second server. This worked considerably better. Unfortunately the problems with getting a user to configure their VPN client (especially the Linux users) were still there. Another issue is that under 2.6 you don't have the ipsecX interfaces, so can't hide the L2TPd on an internal interface and setup iptables rules to forward traffic to it thats been authed by IPSec. As l2tpd is unmaintained and known to have security issues I wasn't best pleased with having to externally expose it.&lt;br /&gt;&lt;br /&gt;Having deployed both of these servers I suddenly realised that I'd just wasted £3k of hardware when I could just have bought an off-the-shelf solution for under £500. I rapidly purchased the crime against humanity that is the Cisco PIX 506E. Not only is is cheap, can do L2TPoIPSec VPNs, but you can use it with the Cisco VPN client, which even an idiot can correctly configure under Linux, MacOS X or Windows. Yay.&lt;br /&gt;&lt;br /&gt;From past experience I knew that the PIX is a massive nightmare to configure, but I also knew that v7 was a hell of a lot better on the CLI side of things than 6. Imagine my dismay when I discovered that they haven't released 7 for the 506E yet. Oh yay. The best thing is that to discover this irritation you have to have a CCO login to the Cisco site, which of course you can't get until your shiney new PIX arrives (actually I already have one, and ought to have done my research better, shhh.)&lt;br /&gt;&lt;br /&gt;Deciding not to risk my mind by writing a PIX config I made the fatal mistake of using the evil Java applet based PDM GUI. This GUI has a couple of wizards that are suposed to take you through basic configuration of the device and setting up a VPN. The startup wizard works fine, but the VPN wizard has a habit of generating configs that just dont work without manual poking. This is completely unacceptable! How could they not notice that one of the wizards just doesn't work properly! Fortunately they do generate most of the config, and it requires only minor manual poking to make the configs work (once you've spotted the mistakes) so I wasn't forced to write the entire config by hand.&lt;br /&gt;&lt;br /&gt;I still can't get over quite how bad the CLI on the PIX is. What idiot designed it? I know that all the Cisco CLI stuff is awful, but the PIX just takes the piss. Cisco are moving towards more sensible CLIs for their carrier grade kit, mainly by pinching ideas from JunOS, but it'll be a while before normal IOS sees many of those enhancements, and eternity until the PIX sees them. Allegedly the PIX v7 software is a massive improvement over 6, but that's not difficult.&lt;br /&gt;&lt;br /&gt;I've now lost track of where my rant was going, so I'll stop with some conclusions:&lt;br /&gt;Linux 2.4 and 2.6 IPSec implementations are useless for running L2TPoIPSec VPNs with Windows and Mac OS clients;&lt;br /&gt;The Stinghorn patches improve 2.6, but there are security concerns re l2tpd, and a few other assorted bugs;&lt;br /&gt;The Windows and Mac OS X L2TPoIPSec clients suck arse;&lt;br /&gt;The cost of two of our standard servers is considerably more than the cost of something off-the-shelf;&lt;br /&gt;The Cisco PIX has the worst CLI known to man, and a broken GUI;&lt;br /&gt;My spelling is awful and I just don't care.&lt;br /&gt;&lt;br /&gt;The Juniper Netscreen apears to have a sane price now, but is it any better than the PIX? Dunno, and I don't own one so I wont find out for a while.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:unperson_:257</id>
    <link rel="alternate" type="text/html" href="http://users.livejournal.com/unperson_/257.html"/>
    <link rel="self" type="text/xml" href="http://users.livejournal.com/unperson_/data/atom/?itemid=257"/>
    <title>Why?</title>
    <published>2005-10-27T23:14:04Z</published>
    <updated>2005-10-27T23:14:04Z</updated>
    <content type="html">For some reason I've signed up for an LJ account despite spending over a year repeatedly declaring it a waste of time. My decision has been at least partially influenced by the failure of my MoveableType install which died during the upgrade from Debian Woody to Sarge. I've survived without a blog-type-thing for a while, but suddenly (possibly influenced by alcohol) have decided that it might be a good idea to try again.</content>
  </entry>
</feed>
