Thursday, May 24, 2007

Single points of failure

At what point is a train over-crowded? I think it's when people can't get on it, otherwise it's just crowded when you're packed in like sardines.
The new summer timetable on First Capital Connect introduced on Monday has obviously brought chaos to the system... each day this week that I've tried to catch a train it's either been late, cancelled or had only half the carriages.

Which makes me wonder: when you're designing an infrastructure for running a website, you try to eliminate single points of failure.
These are places in your network, your hardware or your applications where if there's a fault, you lose the service. This concept must exist in trains too, but how is it applied?

Most of these trains are actually made up of two units, that is two shorter engines hooked together each with their own set of carriages. This should give you fail-over: if one engine breaks down you can use the other to pull the train. Except it doesn't seem to work like that. You lose one engine, you get only half the carriages. So the solution doesn't restore the service, it just allows you to run a diminished service instead.
  • How often do trains break down? Hard to say, but this is obviously an important consideration when trying to work out what's a viable solution. If a unit breaks down once a year, so that 40% say of passengers are delayed, then there's no real issue. But if it happens every week in summer to stock over 10 years old, then you clearly have a problem.

  • How do you solve it? The level of investment is going to depend on how big the problem is, but having engine that are capable of running eight carriages rather than just four would be a start. Add to that the infrastructure required to decouple the units so that the right engine can be put at the front. Does this mean that you need an engine that has twice the power and therefore costs (probably more than) twice as much? I doubt it. Hybrid cars only use power from their petrol engine when they need it, so it must be possible to design an uber-efficient engine that runs on 50% of resources when doing a normal job, but can kick in to do extra work as required.

Of course, I may have got completely the wrong end of the stick and single points of failure really lie elsewhere. Most obviously on the track; if that gets impeded (by a fire for example) then you've had it, even if you can share parallel rails. This is one of those areas of risk with high impact and low likelihood. The system with high impact and higher likelihood of failure is -- for all you Northern Liners out there -- the signal system. I can't believe it's not possible to have a fail-over system in this environment that kicks in automatically when there's a failure in the main. This is what computer networks do all the world over and has to be worth the investment. It means you can take one system out completely for maintenance or improvement and still run the system, albeit at increased risk. No brainer.
I get to think about these things on my train-sauna, when I have no seat and have been delayed for half an hour, so forgive me for just spouting on about it now. If you're delayed on a train yourself, however, it may provide a few minutes of mental escape from the drudgery.

No comments: