Change control done wrong

Change control has been with the IT industry for a long time. As systems have increased in number the need for change has become more and more pressing. A modern company makes hundreads of changes to their infrastructure every day and the complexity of these environments has increased dramatically.

Change control is one of the original keystones to the problem of managing IT. It's codified into standards like ITIL and many other industry standards, and over time has become the norm at many places. As time has progressed however the notion of controlling change has become slowly more awkward. We cannot increase the number of IT members infinitly, and systems have become bigger every year - this leaves us in the situation where the change control could potentially become the bottleneck of the business.

With configuration management systems we've now given our IT teams the ability to act as a force multiplier. No longer constrained by the speed of human endevour on a 1:1 ratio, we can now safely manage thousands of systems with a single operator.

This comes at an expense of course. Change control. You still need to know what's occuring in the infrastructure. This information is still available of course, and in fact is more available than ever if you can make the culture shift of understanding that that configuration held within with the config management system is the infrastructure.

Without naming names, I recently dealt with a client which held that all changes to the systems must be logged. New DNS records, alterations in IP addresses, etc. These 'changes' were then approved by a single person, and these were then actioned. As you can imagine several breakdowns in this system occured. a) the person doing the approvals often didn't log their own changes. b) changes were often logged after the change was made to save time. c) changes logged before the scope of the issue was fully understood created inaccurate logged changes. d) the obvious single point of failure in the approval process.

These issues were mostly failures in the execution of the idea, not in the idea itself. In my opinion there are more fundimental issues with such a system though. Mainly that the system does not log changes at all, it logs intentions. While intentions to make changes may seem at first glance like a change - it isn't.

For instance you have an issue with the disk space filling up - this is how it should be handled:

  • log a change that the disk will be increased in size: approve change: make change
  • find out the issue wasn't the disk being too small, but the app logging too much
  • log a revert of the previous change: approve revert: revert change
  • log a change to change app logging: approve change: make change
  • rinse, repeat.

When written down like this it seems nonsensical, and it really is - but this scenario plays out every day in a badly managed change control ecosystem. There are a few key aspects missing from the example scenario's environment which would greatly improve the situation:

  • configuration management
  • testbed environment

In short, you need to be making alterations in the testbed environment and then duplicating these changes into the production one. The approval comes at the time of applying the now-known-correct fix, not at the break-fix stage. The main issue however is not having a testing environment...You simply cannot conjour up fixed to issues in the production environment without knowing the fix worked in a testbed one, but you cannot have a testbed environment if you've no way to easily spin one up - namely configuration management.

This makes deploying configuration management into an existing production environment very hard indeed. However I believe the improvements to the workflow that occur when a testbed environment exists is invaluable.