Luke Kanies Wants to Modernize System Administration

By chromatic
August 11, 2008 | Comments: 5

You may also download this file. Running time:

Luke Kanies spent some time as a system administrator. That job can have its own tedium -- keeping machines up to date, building new machines, and managing dozens to thousands of individual configurations. Clever administrators automate.

Luke is the primary developer of Puppet, a clever project designed to automate away the tedium of system administration and configuration management, letting you describe what you want rather than telling machines what to do. It's all part of a plan to drag system administration kicking and screaming out of the 20th century.

Luke recently spoke with O'Reilly about Puppet, system administration, and how to provide actual measurable business value to your organizations.

I want to talk to you about obviously Puppet, but more the state of operations management, change management and all of that system administrative stuff, when you have more than one machine. You're on record as saying that a lot of the tools for monitoring are pretty good but a lot of the tools we have for managing instances of machines or managing large banks of machines, server farms, desktop farms, whatever are still stuck in a 20 year old mentality.


How so?

If you look at the way that most people think about management, they take a SSH and a for loop mentality, and the few tools that don't do that, Cfengine is probably the best example of a tool that most people seem to know about, a lot of people will either start using it, or they'll look at it and decide that it just isn't worth the headache. To them it does seem to alleviate the SSH and for loop problem in that it allows you to kind of describe what you want your machines to look like, but at the same time, it's really really low level. If you'd like to provision a given service across multiple machines, multiple platforms, which is a really common problem, [it] forces you to essentially consider that same service to be a completely different service because you have to deal with all these details, you know, "where is the file located and what is the actual service name?" There's no ability to say, I want to step up a level and ignore those details and really focus on the service itself.

For example, I want SSH services running on all my machines, whether it's a FreeBSD server, whether it's a Mac OS X desktop, a Linux device, whatever.

SSH is a great example. Everyone runs SSH. It's produced by the same application. It's always the same service. It's always the same configuration files. But if you look at file location on an OS X they're on /etc. On FreeBSD they're on /usr/local/etc, and on most Linuxes, they're in /etc/ssh. If you look at the service name, sometimes it's SSH, sometimes it's SSHD, sometimes it's OpenSSH, sometimes it's Open SSHD. So, you want to be able to say, I'd like to talk about the SSH service and most schools say, "I don't know what you mean. I've only got OpenSSH."

In terms of what we're looking at, the existing state of administration, when you're calling this 20 year old feature, the thinking is, I have a whole lot of shell scripts, that know how to execute files, and I have to hard-code my paths to my configuration files and if I make a change I have to scp or rsync all of these configuration files to the right bank of machines and the resource concept in Puppet lets you step back and say "Run SSH Platform system and details we'll take care of that." Elsewhere you say, "I want SSH running, and I want it to look like this."

The majority of your configuration files across the machines are actually the same. Look at the password file as an example. You don't think of a password as a configuration file anymore, but clearly it is. 90% of those users in that file are the same users, the system users. And so, we have these good tools to manage the rest of those configuration files. The rest of the files, the rest that are different for us. But for most of the configuration files, we don't have those tools, so we're forced to write our own tools to manage the differences and the similarities. You look at how people distribute config files and what they'll do is they'll have 78 versions of a given file, this one goes to this host, and this one goes to this [other] host and this one goes to all the hosts in this data center, or something like that. For one, you have this huge duplication right? Because you've got a lot of similarity among all these files but the really, the more nefarious problem there is that you don't actually you can look at the configuration that says this host gets this file and what you don't know is what's actually going on.

You don't know why that host gets that file.

There's the why, but then there's also [the question], what's out there that's actually different between this host file and the default file? You're missing the intent, but you're also missing the information itself. All you know is that host has a different configuration file. In fact, most people what they do is say, "Here's the list of files that a host could receive and it will be this ordered list that says look for a host specific file, then a class specific file and an operating system specific file and then the default file. And in that situation, you literally looking at the configuration, you have no idea if a file even exists for this host, much less what's going to be in that file.

It's completely non-deterministic. It's horrendous. Or, it's deterministic, you just can't from a given place, know what's going to happen. You have to look in three different configuration locations to determine what's going to happen, and even then you have to once you say, ah, there's a host specific file, you have to dish that to all the other files to see what's different about this host than other hosts.

This is a direct consequence of being stuck in the "everything system administration is shell script or a configuration file", [and] all we can do is work with these specific files as they are. It's a result of that mentality.

Exactly. Then of course, you can throw someone like NetInfo into the mix. You've got all these great file tools. Somebody says, I've got a Mac and I want to be able to manage some users and you're like, "Those aren't files." You know? Or LDAP or anything like that! I've got APIs instead of files. I don't know what to do!

We can make the argument that the Unix file structure is in fact an API, but I take your point. That's a very good point. When did you have the realization that managing or modeling these behaviors and these services and these configurations as resources in Puppet was the right way to go? Was it something you knew all along? Was it something you picked up from Cfengine? Or did it gradually come up?

About five or six years ago, I had rewritten a tool called ISconf by a guy named Steve Traugott. He wrote a seminal paper I believe in '99 for LISA that kind of went through and said if you're going to have a managed infrastructure, there are nine course services that youre infrastructure needs to provide in order to make your infrastructure work. He did some work at Morgan Stanley and other places. What he called IS COMP which is a very simple make-based tool.

His idea was to store all the work you didn't make, and then execute these make-targets in a specific order and thus you could always recreate the systems. So, this is my first interaction, my first exposure to somebody else's tool to do this kind of work. So I experimented with this tool and decided that what he had, which was kind of a typical combination of make, Perl, and shell was just not quite sufficient. I rewrote the whole thing, but quickly found, and all the work was being done in Make. So you've got a package to install, and you go OK, so I want to install this package. You make a make target for that and you're like, "well now I want to install another package. And the make target looks almost exactly the same except for the package.

I started extracting out what it meant to install a package and I'd create these sort of default make targets. So it'd be like package/percent and using this GMake, you could do all this... what are they called? Like stanza matching things. You could extract the replacement parameter per percent and have your package stanza know what package to install, based on how the stanza was called.

One place you'd say I want package Open SSH and another place you'd have a package/percent stanza that knew how to extract Open SSH from the stanza. This is kind of the start in that, in that it was accidental because Make wasn't very powerful. The way to communicate and the way to extract things in Make was really simplistic. What I found was pretty quickly, I had stanzas for all the major things I needed to manage. I had package stanzas and resource stanzas and cron stanzas and host stanzas, and they knew they knew how to do all this stuff.

The problem with this though, is that packages are easy. They're either installed or not installed. Things like cron jobs or hosts are harder because they have multiple [states]. You can't you know you don't want to try to extract all the fields from a cron job from the make stanza main. That's just not pleasant. I mean, already doing this kind of stuff in make is already unpleasant. So I started doing this and building in one place I'd build a table of all the resources I had, basically a data dump, a stash of all the cron jobs that I'm ever going to install anywhere on my infrastructure. What my make stanzas would do was just say I want that named cron. I'd name each of the cron jobs and then I could extract one of the cron jobs and say, I want that one installed, I want that one removed. I had this symbolic representation of a resource. You know I wasn't using the term resource back then. was a resource. Then another place I'd say "...and this host gets that resource."

Were you able to write out instructions to the makefile? make install ssh or make package openssh? Were you able to describe in code, or maybe a shell script or configuration file, exactly what to install and configure when you set up a new host?

ISconf is kind of weird, in that what you did with the host was you said, "here are the n numbers of make stanzas it takes to turn you into what you're supposed to be." If you're an Apache host and there are 25 make stanzas associated with being an Apache host, and then there are stanzas for each of those 25 make stanzas and they all get run in the correct order every time, so they'll kind of do each thing in turn. It was a really complicated system. And I wrote this paper in I believe '01 or, no it was '02 I think that basically said, ISconf wasn't quite sufficient and it's because Make is a horrible API. So what I did with my integrated ISconf engine, and initially most of the work was still in ISconf but what I found is that all of these models that are written, that I've found in packages ground up as hosts and resources translated really well to Cfengine.

Cfengine has a similarly crappy API potentially, but if you want to integrate Cfengine with an external tool, your only real integration point is [shell script]. With Cfengine I had this similar kind of thing where I had this database of resources and I'd collect what all the resources were and how they were configured and I'd try Cfengine and then I'd make decisions about what resources I wanted. If I'm in this class, then add these five hosts or these five cron jobs to the host. If I'm in this class add these 5 packages to the host. I realized after a few years of doing Cfengine and doing consulting, that I had split. All my decisions were in Cfengine but all my actual work was in another application entirely, and there really wasn't much consistency to that application. I'd do cron jobs one way and I'd do packages another way.

When I realized I wanted to fix that, what I really wanted to do was teach Cfengine how to understand abstract resource types. I want to add a cron resource type to Cfengine. I want to add a service resource type, or any of those things. And then you look at the CF engine code and you realize that it's 60,000 lines of spaghetti C-code and that there are 25 unique syntaxes in the language and things like that.

You have the one maintainer and he keeps pretty tight watch over it.

He does. There was this great quote from him around 2005 where he said that he thought version control was overhead. Of course he's a CS professor, right? So he would know. That was kind of frightening in that you're like "Look, let's collaborate on this" and he's like "That stuff'ss all unnecessary, even though we're getting ... regressions and things like that." But then you also have issues where he would say "This is how it should be." Anybody who starts writing C especially, like most languages, starts with gigantic case statements scattered throughout the whole sequence. Anybody who does very much for very long learns that doesn't scale well. It doesn't work well.

You need to extract... what is it? Refactor conditional to polymorphism.

Exactly. He thought that that was the stupidest thing he'd ever heard. If you wanted to add a new resource type to CF engine, and Cfengine calls them actions, then you had to find every one of these case statements and add support for your actions in that case statement, throughout the whole system. Which meant of course, there's no way you could ever write a redistributable module that could just be dynamically loaded to the device then.

No. You're writing a patch against the whole thing.

Right. When you look at that and you [think] "yes I could refactor all of Cfengine to make this work, or I could take a language that I'm more fluent in, I'm just going to be more productive in in general because it's a much higher level language than C." That's when I started experimenting with a separate tool and with Puppet the core things that I came in with were I wanted to have this core level, core idea of a resource. I wanted anybody to be able to add a new resource type. It couldn't just be up to me what resource types existed. Of course I had to extract I really wanted to separate the language the way we're talking about resources from the line write that defines them, so that you could add in your resource type and not have to modify the parser. Which seems obvious, but in Cfengine world every action needed direct support in the parser.

You mentioned the same thing with your make-based solution as well.

Yeah, you had to go in and modify make and you had to add your code and all three places had to be updated at once.

You worry about shell script quoting and all sorts of other rules there.

Yeah. Shell script quoting is awesome.

Try writing cross-platform make. You can't rely on always having gmake available.

In fact, I had this really fun [time]. I was running this on, at the time I think it was Solaris and HP-UX and that's probably all. It was pleasant enough. We were doing basically a bootstrap, where the first thing to make the stanza involves gmake and then the next stanza just exited if gmake wasn't in use. Then you had to have this massive bootstrap.

That was definitely one of my goals going into Puppet. Let's make bootstrap really simple. If you've got Puppet installed, you shouldn't need to do anything else. You know, there's authentication aspects, so you gotta go through this whole keysigning process, but key signing in Puppet is really really simple. The story, if you heard the Redmonk podcast with myself and Nigel Kirsten from Google last week, he's talking about being at an Apple conference and listening to Jeff Mcuen talk about Puppet and how awesome it is. During the course of Jeff's actual talk, Nigel VPNs into Google and takes up a virtual machine, sets up Puppet on it and is managing students by the end of the conference. That's what you want.

There's a competitor to Puppet called bcfg2 by someone else and he likes to say that it's very reasonable to get a bcfg2 installation up and running in as little as three days. And I'm like, 3 days?

Three days is pretty good. It's an improvement anyway. We'll call it that.

The resources were a big part of what led me to discard the existing tools, and I looked at a lot of those tools out there. You've got LCFG, which is probably the oldest and most mature tool in use by the university of Edinborough. It's got some really great features, but it also has it's just a bunch of shell scripts still. A bunch of whatever scripts you want.

In the end. You've got SmartFrog out of HP, which is again really interesting. It's got a great language, but there's no resources, there's no higher level abilities at all. CERN has this tool called Quator, that was dead for awhile, but apparently it's been resurrected recently by somebody now that the Large Hadron Collider is actually producing data, I guess they needed to worry about it again.

It kind of has the same thing. I went to all these tools and I kind of went, I want abstraction. They went, "we've got great facilities for something or other". Well that's not really sufficient.

I'm sure any large organization or company has some sort of home grown, if not multiple home grown solutions to this.

That's my 90% competition, what people load internally. Fortunately most people don't like what they have. You talk to most IT men and you go, "So what do you use?" And they go, "Ah, God, I got this junky thing that I loaded myself and ugh, I'm embarrassed. I could never publish it because it's such horrible code." But, you can come to them and you can say, "I've got this great tool that everyone loves."

In just three days you can be up and running!

Or in just an hour.

Then people seem to be a lot more interested in that and everyone's really concerned that system admins are going to have so much ego attached to their own implementation, but the truth is, that system admins are pragmatists in almost all cases. Their goal is can I go home earlier? Is there someone that can get more done in less time? You know they're not going to publish their code anyway, and Puppet is actually it's a good enough tool it makes it easy enough to write readable simple code, that people are more likely to publish Puppet code, than they are their own tools.

I understand you're working on a repository of resources and models that people can reuse and redistribute.

One of our goals for the year is to get what's recognizable, especially to you, is a CPAN of Puppet models. Most people's problems aren't unique. Everyone has some unique problems, but for a given person's problems, most of those problems are shared by a large number of other people. If you can download solutions to those problems, rather than having to figure it out yourself every time, or not even figuring it out. It's easy right? It doesn't take intellectual effort. It just takes you typing out the stupid code, debugging it, making sure it all works across all your platforms and then you hopefully never have to think about it again. Our goal is to make it so you can download these things. We're going to figure, sometimes it's going to make sense to allow people to show solutions on there as opposed to just downloading them. I don't know about that kind of stuff, but it'll be an interesting experiment anyway.

I imagine at this point that people are reading or listening to this and thinking you know, wait a second. I can just make an image on an installed server or I can just have a virtual machine and clone that. Why do we need this configuration management to help me set up things? I suspect I know what you're going to say, but can you respond to that?

This comes up a lot in the cloud thing. People say, you know I don't need Puppet in the cloud because I've got virtual machines and I can just clone as many virtual machines as I want. And that can work, but if you look at say your kind of standard LAMP stack, what you're going to have is at the very minimum, you're going to have a load director in front. You're going to have a web server, an application server and a database server, and this is assuming a single application. Now you've got four different images that you've got to maintain. Anytime you want a new one, you bring up a new version of that image. But if you ever want to change that image, let's say, bring out a new version or you add a new user to your system, how do you add that user to your system? Well you modify that image, you upload the image to EC2 or whatever and then you reboot all of the machines on your network to get that new user.

What if you want to add a new user to all four of your images? Now you're opening up all four of your images, adding that new user and then rebooting your whole network, just to get a new user.

But I can just do this with a for loop over a shell script. Right?

Exactly. Now you're back to square one. The truth is that images make a lot of things a lot easier, but when it all comes down to it, VMWare is great for managing the outside of a box. I've been told this is a horrible analogy, but the way I think of it is, all of these virtual machine systems -- they're really good at producing and managing eggs, you know these self contained, sealed eggs of functionality. But they're not very good about getting inside the system. They can't get inside the egg and manage what's going on there.

You need a tool like Puppet to get inside the machine and say, alright now that I'm up and I'm running, let's differentiate this host. Maybe, so you look at most organizations what they do is get something like kickstart or jumpstart or ignite to get the machine from bare metal to functional with as little as possible. Kind of an operating system but that's completely undifferentiated. Then that installation gets Puppet on there, and from there Puppet does all the rest of the work.

The great thing about this is, you know a machine built six months ago has all the changes that you've made to your configurations and it will still be the same exact configuration as a machine built today. If you add a user to your configuration, all the right machines get that user. You know if you update a security package, all the machines get that security package. If you're using images, your question is always, which image is that guy running? Especially with real hardware, you have that issue of you don't really want to reboot your real hardware all that often.

Virtual machines, in really simplistic cases, the images are great. If you've got a cluster of 1000 nodes that are exactly the same, who cares, right? You spend as much time as you want tuning that image and then you can reboot your whole cluster at night and all the scientists who are using it or whatever, they don't care, they're gone, and that works great. But if you've got four or five different images, or worse, images that are not quite the same but are pretty similar then you've got a real problem on your hands.

You can use Puppet not just for installation and set up, but for continuing maintenance.

Oh yeah, you would. The idea with Puppet is that you would use it from the day you turn the machine on until the day you turn the machine off. You should never have to log onto that machine unless it's to consume the service that it provides, rather than logging in to administer. And most importantly, the key and [inaudible] does this too, but Puppet does it better, the key feature is that you write this application that manages your entire network. You have this infrastructure application that knows how to provision and maintain all the services on your network. As you update that application, Puppet brings your network into sync. If you change what it needs to be in your Apache Server then Puppet changes your Apache server to meet that definition, or it provides helpful suggestions saying, you know that package doesn't exist or whatever the error is. So you should never have to worry about services being out of sync, even if one machine was built one way and another machine was built another way, or Johnny built this computer and Billy built this computer and therefore they're configured differently. This application manages all of your services, for their entire lifecycle.

And keeps them up to date.

Let's talk about operations. As I said before, you're very much on record as saying we're stuck in the 80's mentality. I very much see how Puppet and other tools like this are leading us away from that. Are there areas, besides configuration management, change management where you see operations leading up to "update or die"?

Think the biggest problem with operations right now is that there's a real disconnect between the people who are doing all the work and providing all the value and the people who are writing the checks. I was talking to some people, including Tim O'Reilly earlier this week and you can pick up any system admins book off the shelf and I defy you to find where it talks about metrics. How to provide useful metrics to your bosses. How to provide useful metrics to your executives indicating the value that you're providing. How much have you done to demonstrate that you've reduced the error count for your network. To be able to say, sure we've got 5.9 or whatever. That's nice! But how much have really done to say, "Here's what we're doing. Here's what we've done."

Every time I've ever been provided, I mean even simple information -- I wish this were a joke, I've never joined a company that knew how many computers it had. Never. Let's talk about a really simplistic mess here. Just starting at that metric, if you can provide that to your employers, that's a start. But if you go on from there, to much better metrics. Like what kinds of things are we doing? How many trouble tickets are we responding to? How many changes were made? Were the changes successful? What is the root cause of all the failures we're having? If you get that information, then you begin really validating your presence.

That's a really simplistic metric but there are other really great metrics you can provide. Obviously, all the changes you're making, while you're making those changes, are they in response to exceptions? Are they adding new features? Are they customer requests? These are the kinds of metrics that executives need to know. They need to know Why are you spending all of this time doing this? Where does all of this time come from? I know system admins constantly complain [and say] "Oh, I go to my boss and I need this or I need that and they just won't give it to me." I ask them, "Well, how did you try to convince them?" They say, "Well I said it very loudly."

What you need to do, is you show up with the same graphs that the sales guys show up with, what the other organizations show up with, and say, "Look here's data, showing what I need," there are very few bosses who are going to turn you down.

One of my first big goals in Puppet is to build an ecosystem of tools that allows me to really manage what's happening on the network, in a way that I can explain, that I can provide a clean interface to not only my immediate bosses in the IT organizations but to the high level executives saying "Look, we deployed 1000 servers this week. We deployed 1000 services. We did this. We did that. Here's an R&D draft of changes that are going on in the network. Here's where we have this spike of errors because someone's forced us to do this, this quickly." An upgrade that we recommend against. Things like that. That direct feedback between, the people writing the checks and making the high level strategic decisions and the people doing the actual work on the ground, that's really missing today.

You look at application organizations [and] software organizations [and] a lot of what they've been doing is, "How can we connect our software development to our business needs?" You listen to Potworths Podcasts, and you know you read their books. It's all about that. What does the business need? And how can we best meet their needs? And then demonstrate meeting those needs.

Are we as operations people -- I'm not really an operations person anymore, thank goodness -- are the operations people not doing this because they don't know what's necessary? Because it doesn't fit our engineer brain personalities? Because we've never really had the tools? Or is it some combination of all three.

I really think it's a combination of the three. Especially as you get feedback. You go in and you say, you come to a conclusion, you're convinced it's right and the first time you get told, no that's not right, justify it. I shouldn't have to, it's just obvious. I've just convinced you via strong logic.

Here's my engineer brain. It's obvious, just trust me.

Right. There's this real belief especially in the engineer community, and especially especially in the system admin community that you shouldn't need to sell something. I shouldn't need to convince you via anything other than argumentative logic. To some extent that's true, but data isn't necessary sales. Using good graphs, using good information and using good research to demonstrate isn't really sales. Even if it is sales, it's probably one of the most important things that system admins can do. I'm not that good at it, but if you do a really good job of selling the value of your organization to the leaders of your organization, you'll be more successful than any tools you could ever write.

I'd love to tell you differently, but if you sell well internally, you'll get access to buy whatever tools you want. If you don't sell well internally, then you're going to be starved for resources, you're going to be fighting with users instead of enabling them, and the truth is, if you're a system admin, and you're not there for your users then why are you there? If you're not there for the organization, then why are you there?

It's great to build all these computers, but they're not hiring you to build computers. They're hiring you to solve their business needs.

Do you agree with the assertion that IT is just a sunk cost and that it doesn't really matter?

Oh, no! Not at all. If that was the case, then you couldn't differentiate on key you know. Clearly organizations like Google, Amazon -- Amazon especially because it's an operator. They're not a sunk cost. They've spent all this money on IT and not only did they save tons of money on their organization -- they've got a fabulous organization operationally -- but their organization was so fabulous that they were able to start selling it to other people. There's no way, if they had viewed their IT as a cost, then they would have stopped investing in it when it was sufficient. And there's no way to have EC2. And who's talking about Amazon? Why is everyone talking about now? It's certainly not their storefront right? It's their EC2. It's their S3. That's where all their press is coming from today. They basically brought the cloud to the forefront, and it was because they had the vision to not view IT as a cost.

Tim likes to talk about IT operations as a competitive advantage, and that and in many other cases, it's a clear cut example of somebody said, "If we do it right, and we're out to compete, and it's not just a question of reducing costs or a question of how cheap can I make IT. It's a question of, 'How can I make IT work for me? How much can I make my services deploy faster with fewer exceptions, requiring less human input and a faster ability to respond to issues.'" If you've got all those things and you invest in that, then when things do need to scale, then you're ready.

Why don't people see this now? Is it again, the lack of salesmanship? Or, no one's really asked that question before? Or, everyone tries to put their operations in a corner and says, "Okay, we hope we're able to compete on this, but we're going to keep it a secret now what we're doing, like all the clients on Wall Street."

I really just think it an uninteresting and unsexy space, and people don't want to talk about it. I've been trying to convince investors and developers and anyone who will listen, anyone who will let me talk for five minutes, that this is an interesting problem and that it's worth spending time on. i seriously get crickets. I was at the Web 2.0 conference in '05, mostly so I could learn how to steal some of their neat ideas and use them in system infrastructure and I told them what Puppet did, and they were like, "Wait what kind of consumer application is this?" And you go, "No, no no. It's the stuff that makes web applications work." And they go, "Well that's just unnecessary."

It's not sexy. It's not fun. It's not interesting. But, because they're not interested, it's harder to sell to them. Because it's harder to sell to them, the system admins feel like they shouldn't have to sell, and because system admins don't try very hard to sell, what they're doing is uninteresting and the executives go, "Well that's not a very interesting sales pitch."

"Solved problem. They have their shell scripts and make files and they're happy."

Certainly there's an expense there too. Everyone who I talk about Puppet to, especially anyone who's an executive or who's in sales, says "Great! So I can install Puppet and then fire a bunch of people." And you go, "No what happens if you install Puppet is that your service, the service your IT organization provides you, gets ten times as good." What ten times as good means, is they can actually start selling to you. If you look at what it takes for a system admin in an organization to implement the tools it takes to do a good job of demonstrating the value of IT, it's a ton of work. It's a ton of integration and it's often today it's a ton of development.

How do you get the resources to do all that work and to do all that development? Well, you have to do sales. How do you sell that to the organization without any data? It's really hard. You have these kind of nasty feedback loops, where it's up to some person internally. Usually the only way to get out of this loop is to have at least one really kick ass system admin to just say, "You know what? I'm not going to do what I'm supposed to do. I'm going to do what I know I should do, but I'm going to focus on sales. I'm going to focus on convincing my organization that this is awesome. Then you need somebody in the managerial chain that says, "I see the value in that. I understand the logic of the sale you're trying to make and I'm going to give you the time and space that you need to get it done." As long as that time and space is a year or six months, it's reasonable. They're not going to give you 5 years, you know.

The best example I know of is Zed Gibbler from Morgan Stanley in the 90s, built this cool tool called, I believe it's called Aurora. It's not a tool, it's an infrastructure. It was published in the proceedings of LISA in, I think, '96. They're still using it to this day and they haven't really changed it much. Somehow, he got the right, internally, to just rebuild their entire infrastructure based on this great design and as a result, they've had a completely automated infrastructure for ten years. And I think they're the only bank that has this across the board, automated infrastructure. And I talk to the bank and I go, "God, that would be awesome." They've had so much success with that now, they're looking to replace it. They're trying to build a new version of this to last the next ten years, and there's no discussion internally. I mean, everyone knows just how you want to do it. No one talks about bringing in big commercial companies to solve the problems because they've had so much success building this like they have before that the sales job is much easier. Once you get that first part of success, then the rest of your sales come that much easier.

After we get this part solved, selling IT, selling operations to people with checkbooks, what's the next step for system administration, for operations, for integration and configuration? What happens then? What's the next problem to solve?

I think the next problem to solve is making things more manageable. Every system admin can sympathize with being given a tool to manage and trying to figure out, how can I automate this thing? Every Oracle person will tell you, "Oh you cannot automatically install Oracle" Or you can only do it under these certain circumstances. And you just learn, you know I'm not going to listen to you. I'm going to try to find my own way to automate this tool. Right? Oracle has these, these and I don't remember what they call them now, but Oracle has these things that they claim can do automactic installs that they claim only work under certain circumstances. Every tool has this problem. Every new application introduces this kind of complexity.

As we get better tools, as we get more sophisticated ecosystems and tools, so that instead of your monitoring system being a silo and your configuration management system being a silo, they start to communicate and you start to get, not an autonomous system, but you start to be able to say "Okay, Puppet upgraded SSH last night. And SSH is broken on all my machines. I wonder if there's a relationship there." Having that kind of work being done automatically, once you have all that, you want to have tools that provide better management interfaces in the first place.

I was just talking to a pretty big software company this week. They're the first vendor to ever call me and say, "How can I make my software more manageable?" This has been a dream of mine. I want every vendor to ask themselves this question, "How can I make my software more manageable? How can I make it easier to plug my software into an integrated management system?" As you get a completely automated system, as you get the ability to manage more of your network, more of your infrastructure, what you'll find is, partially you're going to always choose your software based on what it does for you. Right? But you're also going to be choosing your software by how manageable it is.

Once you can start choosing between two essentially equivalent pieces of software in a relatively commoditized space, one of them is easily manageable and plugs right into your infrastructure, or possibly even ships with the puppet code to manage that piece of software. One of them does not. One of them is just like, well you've got a [choice]. And you've got to put the tarball in that place because it was written by DJB and he's really insistent. When you have those choices, it becomes an easy choice.

At that point, you should need fewer people on the ground doing the nuts and bolts infrastructure kind of development that says, "Here's how you manage that software, and here's how you manage this software." Then it really becomes a question of how pieces of software relate, how you plug them together and it looks a lot more like a jigsaw puzzle than a needle and thread, kind of sewn together mummy like it is right now.

Or, here's a pile of wood. Here's a jigsaw. Go ahead and build your own puzzle.

Exactly. I think when that happens, you'll naturally see a split in system admins between the people who will swap out hardrives, and are really on the ground nuts and bolts and the people who are much higher level and tend to be, right now they're the infrastructure architects and the infrastructure developers, but in the future will be all about integrating these pieces rather than trying to reverse engineer how to manage them.

Every company is reverse engineering those right now. But they won't have to in five years.

That would be the hope wouldn't it?

I don't ever want to do that again, personally.

Exactly! I keep listening to system admins talking about it on the Puppet channel and I go, "Ugh, I'm so glad I'm not doing that anymore." If you're a sysadmin listening, this is one of the ways out. Write your own automation software to allow other people to solve their problems, and they will hopefully give you enough of a living, so that you don't have to solve it yourself anymore.

Five years later we'll interview you!


Luke, I really appreciate your time. This has been fascinating, and I believe and hope that it's useful to our readers and listeners.

I hope so too.

You might also be interested in:


Looks to me like this was transcribed, and transcribed quite badly. Low balancer? Surely Luke's talking about a Load Balancer. Are we talking about an 80's mentality or a 20's one? This interview really needs to be properly edited - it's not just a small blog, it's O'Reilly.

I must agree with Ian Morrison's comments. This has been very poorly transcribed. For example, the sentence "I had rewritten a tool called IS Comp by a guy named Steve Chodder" would actually refer to ISconf written by Steve Traugott! Some editing would help the flow of this article too. Come on O'Reilly you normally produce brilliant articles!

Ian and Jonathan, thanks for your corrections. I've made them to the article. Yes, the article was transcribed, and some transcription errors made it through to the final piece.

Errors aside, a great article! Puppet changes how you think about system administration. Luke's answers are spot on. Administration is no longer a task, it's a process.

and Zed Gibbler should be Xev Gittler: ( for the Aurora paper.)

Popular Topics


Or, visit our complete archives.

Recommended for You

Got a Question?