Blogs

ITIL can suck, but shouldn't

The pattern of my career over the past five years or so has involved moving newish, smallish internet/software companies to a post-startup hosting infrastructure. My past three companies were all small companies that developed and hosted internet applications, either for clients (in one case) or their own products. My role in each case was to move things to a more mature infrastructure, with configuration management, monitoring, directory services, and the other pieces needed to be able to manage a growing sprawl of servers and applications.

My focus has been much more on the technical than on the people processes for running and supporting the infrastructures. In my current job, the team I've brought in and the infrastructure we've built has reached a decent level, although there's certainly plenty more to do technically. But looking at what we've done and what I want to do next, I've realized thatrather than moving on and doing the same thing at another company, the more interesting challenge will be to take things to the next level.

The next level for me is going beyond the technical to focus on the people and processes. The technology infrastructure is going to grow in size and sophistication, including spreading to multiple data centres globally, but the technical challenges seem like more of the same to me. The challenges that seem newer and more intriguing to me personally are more along the lines of how the hell we're going to organize and coordinate people doing development, infrastructure, and support in three, four, or more countries.

So I went on a course in ITIL version 3. Yikes. ITIL is basically a blueprint for organizing a huge IT operation with lots of bureaucratic processes, forms, and signoffs that will make it nearly impossible to get anything done, and ensure that responsibilities are divided so that nobody who is doing anything productive sees the big picture.

I don't think it has to be this way. I actually did find the course useful, although not as useful as it could have been given that most of the people on it were more interested in ticking off the certification than getting ideas on how to improve the organisations they work at. There were some pretty interesting people there, some of whom were obviously interested in fixing real problems "back home". If the course had been more of a workshop where we shared war stories and ideas, it would have rocked.

A lot of the concepts in ITIL are useful, I think it's more a matter of using your head when applying them, making sure to adapt the ideas in ways that fit your needs and objectives. It's very easy to see how an organisations, especially large ones, take the ITIL material and use it to build horribly inefficient IT structures. I've worked with companies that use ITIL this way, and the course shed light on how they got this way.

The biggest problem with ITIL is that it's presented with clearly defined "phases" of strategizing, desiging, deploying ("transitioning"), operating, and improving IT services. This is an invitation to a waterfall model, where (as in at least one organization I know of) each phase can even be run by a completely different team of people.

So one group designs the service, hands it off to another than rolls it out (tests and installs it), and then hands off to a completely separate team that supports it. In the organisation I've encountered, the operations team hasn't got the vaguest clue about the service.

Of course the transition process involves "knowledge transfer" where the people who set up the service train the support team, but anybody who's done this stuff in the real world should know better.

Knowledge that is transferred in a handover process is never, ever, ever going to be learned as well as knowledge that comes from actually being involved throughout the whole process. Having some hands-off manager (ahem) overseeing things all the way through doesn't cut it, the people who will actually be diving into runtime problems with an application need to have gotten their hands dirty trying to install the application, and even have pitched into meetings where the details of how the application should be integrated into the infrastructure.

Otherwise, you're going to end up in the situation of my nameless organisation, one which is actually often held up as an exemplar of ITIL. They host an application on their servers, installed by the transition team, and their support team had training on how to log into the server and investigate problems with it. But when users call up with problems, the support people, who probably support dozens of applications, have forgotten all of this. They call up the software vendor - who have no access to the servers.

Can you imagine how incompetent your organisation looks when it's clear that your support people have no idea that the application they support is run by their own company?

But I do think it's possible to take many of the ideas of ITIL and apply them in a more agile manner. A bit of Googling shows I'm not the only one who thinks so, but that there doesn't seem to have been much work done on the idea, at least publicly. It's certainly something that would take a bit of thought and work.

My first thought, clearly, is that an agile IT services process would have to embrace the lean management principle of empowerment by having the "workers" (for lack of a better word) involved throughout the process.

I've also thought that the kanban approach to agile is paricularly suited to a sysadmin team, since it does away with the iterations/release cycle in favor of a queue of tasks that people pull from when they find they've got spare capacity.

Anyway, I'm looking forward to thinking this stuff through and trying out ideas over the next year. Although I'm going to be far less hands-on technically, my focus does need to involve a thorough understanding of the technical aspects of what we're doing, so I don't think I'm going to become a total suit.

SAN Virtualization

Virtualization in the server fram relies heavily on storage, something I understand only to a certain level. This
review by InfoStor discusses virtualization of the SAN itself to get the flexibility you really need to get the best value out of server virtualization. It's somewhat over my head at the moment - I haven't gotten to this point yet - so I'm putting this link here for future reference.

Found via virtualization.info.

Recommended blogs for virtualisation

I've added a few virtualization-oriented blogs to my feed reading. My favorite is VMUNIX Blues, by Mark Mayo. His posts aren't as frequent as the other two below, (although much more frequent than mine!) but the quality is high, he's doing hands-on stuff. I first found him by Googling for thoughts on shared storage approaches, which turned up this post about using NFS with vmware server.

Virtualization.info is by Alessandro Perilli, a consultant who specialises in this stuff.

VMblog seems to be driven by industry press-releases, but is nevertheless a worthwhile resource.

Comments off

Grr. Turn your back on your blog for a minute, and it gets filled up with comment spam. Ok, maybe it's been longer than a minute.

Anyway, I've upgraded to a newer version of Drupal, and turned off comments for the time being. I had to switch off my old theme because it has issues. The current theme will probably stick around for a while. I do plan on tackling the comments though.

Whipping up a solid LDAP infranstructure

I've been much too quiet lately. I'm still hard at work putting together what I hope will be a very strong infrastructure for my company's application hosting operations, with about 15 servers for production, content management, and staging and testing.

One of the core components of this infrastructure is an OpenLDAP server, which I've been working on over the past week. Up until now it's been enough to have a couple of accounts which are created locally on all of the servers by puppet. I've got a chunk of disk space on a SAN which is shared across the machines, which is handy for having a common home area for key accounts I use to login and administer the machines, as well as the puppet templates and manifests.

The cool kids talk about operations

Tim O'Reilly, the boss of O'Reilly publishing and a key booster of the Web 2.0 meme, recently posted an article about operations.

One of the big ideas I have about Web 2.0 [is] that once we move to software as a service, everything we thought we knew about competitive advantage has to be rethought. Operations becomes the elephant in the room.

O'Reilly laments that most of the tools for deploying systems and applications on open source platforms (i.e. Linux) are not themselves open source. Luke Kaines and others have commented on the article with examples of open source deployment and operations management tools, including Puppet, and others I've mentioned for system configuration and network monitoring.

I disagree with Lessig's evaluation of the Net Neutrality camps

Lawrence Lessig declares that the two camps of the Net Neutrality debate are those who built the Net vs. those who never got it. I don't think that's accurate, the telecomms and cable networks have been a pretty key part of the Net. (Found via Rafe, btw).

I think the real division here is content providers vs. pipe owners, and the attempt to do away with network neutrality is essentially a coup attempt by the people who own the pipes.

People use the Net for the content, so that's where the value is. The pipes are just a commodity, which are expected to simply deliver the content. Selling a commodity service means competing on price, which means low margins.

Red Hat Network: How can they charge for less than you can get for free?

My first experience with the RedHat Network reminds me of a major limitation of commercial platforms that doesn't get much press. You actually get less than you do with free alternatives like apt-get and yum.

I'm setting up a new hosting infrastructure for a client which, among other things, involves moving from the free Fedora to commercial Redhat Linux. Although I've managed Redhat machines before, this is my first time using the Redhat Network (RHN) for installing and updating software.

In the past I've used apt-get on Debian, and yum on Fedora, and found them a godsend. Set up properly, it takes minimal effort to keep multiple systems up to date and consistent, whereas when I've had to go the "by-hand" route, machines invariably ended up with older versions of software. It's just too hard to keep up with all the various packages installed on various servers, not to mention the headache of chasing down various dependencies and resolving conflicts when you do upgrade or install a new package.

Ready for disaster

There are a lot of things you can do to make sure that when disaster strikes, you can get back online. Even in environments where you don't have automatic failover, you can take some basic steps so that when you get the alert or the phone call, you can bring things back online.

Let's say you have a single server running a web application with a local database. However, you need to have a second server available. Maybe it's doing something else normally, maybe it's in a less than ideal location, like in your office at the end of a slower Net connection, but as long as you can fire up your application, repoint DNS, and be online, it'll do in a pinch.

First, make sure you have the base server software ready, so your web, application, and database software are installed.

Second, make sure you have a copy of your application code and configuration files handy. I always like to have these in source control, on a server other than my live one, so in the worst case I can pull them down to my emergency location.

Third, you need your live data, that is, your database contents. Take frequent dumps of the data and have them handy, again away from your live server. Do this outside your system backups, use your database tools such as mysqlbackup to dump a file, then zip it and ship it somewhere else. How frequently you do this depends on how often the data changes, and how important it is to have fresh data. In the most extreme case, you might have the database continually dumping a log to a shared file store, where the backup server is reading it in.

A sticking point may be the DNS. You can change the DNS, but users will have the old DNS information cached. How long it takes for them to get the new IP address depends on the TTL you have set in your DNS configuration, and changing this to a lower value after the crash ain't gonna help. A two hour TTL is probably a good setting.

Of course, better yet is if you have multiple servers behind a firewall and/or load balancer, so you don't need to change your DNS at all, just reconfigure and go. But if you're running a budget setup, these are simple steps to follow to make disasters a little less stressful.

Why cheap hosting services are so cheap

Why would someone pay £500 per month for a server when they could pay less than £100 from a different provider? The short answer is support. But it's a more complicated story than that.

One of my clients had a key server die this week, one of many they have with a cheap hosting provider. After investigating a bit, it was clear that it was not a system error - I was briefly able to examine the system logs using the provider's web based recovery tool, with no evidence of problems, but then the server stopped responding even to the tool.

So I called support. The first-line support guy verified that the machine wasn't responding and couldn't be recovered with the online tool, so he referred it to engineering at the data center.

However, he couldn't tell me when they would investigate the issue. It depends on how busy they are. With a real hosting provider, you get an SLA that includes response times. You also get someone who will communicate with you, to make you feel like they're on the case. I only got the promise that I would get an email when it was fixed, no direct contact, no phone call.

In the end it took three hours for them to fix the server. This isn't an unreasonable amount of time, given the cost of the service, but during those three hours my client decided that the three new servers they had decided to add that very morning should be gotten from another provider instead. They'll probably end up paying five times the price, but they know they'll get better support.

The key point here isn't that the service was poor. They fixed the problem fairly quickly. It isn't even just that they could not promise a response time - with thousands of customers paying less than a hundred pounds a month, they can't afford to make guarantees.

But even a cheap and cheerful support organisation should have a way to give updates on their progress, even if it is just via semi-automated emails. Let me know when your engineers have started investigating the problem. Let me know when they've identified the cause, and when they've fixed it.

The frosting with these guys came when they sent the form-email that the problem was fixed. I replied, asking if they could tell me what went wrong. The reply came the next day: look at your server logs.

That's lame. That's passing the buck. It's not in the server logs, because it's not an OS problem. Hard booting didn't fix the issue, and whatever did fix it didn't involve changing any configuration files or such on the server. So it was a hardware or network issue of some kind.

I guess I'll never know. And I guess I'll never use these guys again.

Syndicate content