At this year's puppetconf, I spoke with a few people regarding how they were using puppet in their infrastructure, and the general consensus was that they have several staging environments with replica infrastructure where they can test their puppet code and software patches before going live.
I'm not sure if there's a monitory difference in our organizations, but on explaining to them how we used puppet for automation it became quite clear that we go about testing our code quite differently.
If i had to summarize, i would say that while other organizations have test-beds for their infrastructure, we have test-beds for our code. While most companies have separate infrastructure which they test upon, we don't have the money, or personnel to perform such a thing. We still want to test our code as best we can, and we want to do best-practice where possible, so we stage our changes.
How this actually works
We use vagrant. This is essentially the only replicated hardware we have. We write puppet code against our vagrant puppet-master(s) and test it against a puppetised vagrant client which has the "role" of the real client machine. This ensures that the puppet code is syntax complete and actually tries to do what we want it to try and do.
The next step is checking that code into our personal fork of the puppet code repo and raising a pull request. This allows a second pair of eyeballs to look the code over. We do allow pushing very minor changes directly to production, but only where that change has no logic changes (altering an email in hiera is OK, adding a new resource or 'if' block in a manifest is not).
Once given the rubber-stamp it's in our puppet repo's staging branch. At this stage we have a group of server which are defined as less important, which run permanently against the staging branch.
no-op all the things !
The one step we're neutered puppet on is no-op... All our machines (including staging) run in no-op mode by default. This lets us view the puppet enterprise dashboard and filter by our guinea pig machine group to see what changes would have occurred. The guinea pig machines are carefully chosen to represent a good cross-section of our infrastructure in terms of role, operating system versions, and so on. Effectively, we're using production servers to stage other production servers.
Finally, if the changes on the dashboard are as expected, we do a real run on the guinea pig machines, and rubber-stamp the change into production. If the production machines (which are also doing period no-op runs) don't report anything scary, we do a real run on all machines via mcollective.
This allows us to use puppet with a semi-robust change control, while avoiding the overheads of a second infrastructure. Using vagrant puppet-masters lets us create and alter puppet code from our laptops anywhere in the world while still having a good idea how it'll affect the real machines.
We're not confident in the system enough to do real runs by default in production, but as time goes on we'll see how it goes. For now i think it's a decent trade-off, but one that it seems few people are doing. Hopefully this will let people know that it does work in the real world !
To make this a fair appraisal, I should mention some of the trade-offs made in using this approach. Firstly, your guinea pig machines must reflect the whole estate well. It's no good not having a DB server in there if your puppet changes affect DB machines.
Secondly, it's slow. There's human eyeballing in there, there's several rubber-stamping stages. Because we use no-op heavily this mean there's several wait-stages built into the process.
This second issue can be fixed by having robust role-testing. That is to say, if you have a scriptable way to prove that a machine is doing what it's meant to (i.e: proper monitoring) for all systems, you can let the guinea pig machines run real puppet runs by default and use the scripted testing to ensure there's no deal-breaking mistakes in the code. This is an area that we need to improve on before we can do such a thing. We've good monitoring coverage, but it's integrity needs bumping up a bit.