Change control done wrong

Change control has been a tool within the IT industry for a long time. As systems have increased in number and density the need for change and its control has on the surface become more and more pressing. Modern companies make hundreds of changes to their infrastructure every day even though the complexity of these environments has increased dramatically.

Change control is one of the original keystones to the problem of managing IT. It has been codified into standards like ITIL and many others, and over time has become the norm at many places of work. As time has progressed however, the notion of controlling change has become slowly more... awkward. We cannot increase the number of IT members infinitely, while systems have become ever more complex every year - this leaves us in the situation where the change control could potentially become the bottleneck of the business.

With configuration management systems we've now given our IT teams the ability to act as a force multiplier. No longer constrained by the speed of human endeavor simply on a 1:1 ratio, we can now safely manage thousands of systems with a single operator. This comes at an expense of course and it flies under the flag of change control. You still need to know what's occurring in the infrastructure, and while this information is still available, a cultural shift is needed to aid in the understanding that configuration held within the config management system is the infrastructure.

Without naming names, I recently dealt with a client which held to the belief that all changes to the systems must be manually logged. New DNS records, alterations in IP addresses, etc. We're talking 90's computing gone wild here. These 'changes' were then manually approved by a single person, and then performed. As you can imagine, several failures in this system occurred. a) the person doing the approvals often didn't log their own changes at all. b) changes were often logged after the change was made to save time. c) to be proactive, changes got logged before the scope of the issue was fully understood, which created inaccurate logged changes. d) there's an obvious single point of failure in the approval process. e) The changes were often high level in description, rarely having the needed detail.

These issues were mostly failures in the execution of the change control concept, but they reflect a larger, more fundamental, issue with old-school thinking meeting modern computing. At a base level the system does not enforce the logging of changes at all, it logs intentions. While intentions and outlines to make changes may seem at first glance like a change - it really isn't.

For instance, you have an issue with the disk space filling up - this is how it would be handled:

  • log a change that the disk will be increased in size: approve change: make change
  • find out that the issue wasn't actually the disk being too small, but the app logging too much
  • log a revert of the previous change: approve revert: revert change
  • log a change to change app logging: approve change: make change
  • rinse, repeat.

When written down like this it seems nonsensical and highly inefficient, and it is - but this scenario plays out every day in a badly managed change control ecosystem. Progress of the infrastructure moving forwards is at glacial pace. There are a few key aspects missing from the example scenario's environment which would greatly improve the situation:

  • configuration management
  • test-bed environment
  • infrastructure resiliency

In short, you need to be making alterations in the test-bed environment and then duplicating these changes into the production one via a config management system. The approval comes at the time of applying the now-known-correct fix, not at the break-fix stage. The main issue however is not having a testing environment. You simply cannot conjure up fixes to issues in the production environment without knowing the fix actually worked in a test-bed one, but you cannot have a test-bed environment if you've no way to easily spin one up - namely configuration management !

This chicken and egg situation makes deploying configuration management into an existing production environment very hard indeed, particularly if you're working without buy-in from all parties concerned. However I believe the improvements to the work-flow that occur when a test-bed environment exists is simply invaluable.

The last point, resiliency, should not be underestimated. Large companies like facebook and google not only test all their changes but they do staged roll-outs and have systems which can survive partial failure. You need systems which are designed to be upgraded in multiple stages, and can survive one of those upgrades going awry. It's no longer good enough to have carefully curated production systems treated like sacred ground - often said as "systems are cattle, not pets".