Freedom versus Security in Devops

April 25th, 2015 Permalink

I've worked with organizations in which operations support took the form of a human black box that responded to any request with "no", and in which experienced development teams were forbidden from launching a server or opening a cloud account to try something out. It could take months to request anything new. I've also worked with organizations in which every individual engineering team was responsible for managing its own devops support, all the way down to arranging technical vendors and managing cloud accounts. In the former case, there is a huge invisible ball and chain attached to progress. In the latter case there is duplication, competition, and the need for people to spend time on integrating disparate internal efforts that were performed in complete isolation of one another. Personally I'm more in favor of the second approach. Internal competition, internal diversity of approaches, and the ability to experiment on a whim, decided by the people who know the most about the matter at hand, provide benefits that far outweigh the disadvantages. Forcibly imposed monocultures of any sort, and oppressive, distant operations organizations in particular, stifle innovation and cause the best and brightest people to leave in frustration.

What to do about security, however? Innovation is fine and good, but one has to accept that most good developers, and even most devops specialists, are not knowledgeable enough about security to reliably produce bulletproof deployments, bulletproof cloud services, or bulletproof anything else for that matter. While the state of knowledge regarding security of online services in the development community is a lot better than it used it to be, thanks to many high-profile issues in recent years, at the same time there is so very much more to know than there used to be. The picture has become enormously complex, and even experts in the field of online security now have their specialties, the scope of their experience narrowing as a fraction of the whole.

It doesn't matter how rapid a pace of innovation a company keeps up if it is taken advantage of by attackers. The overall cost of any successful attack is very high, and that threat is exactly why we see so many companies adopt the ball and chain of a stifling operations organization. Those in charge can't figure out a path forward with a higher expectation value of producing successful armor, and at the same point in time still looking like they did the right thing even if the armor didn't work. I am not convinced that the operations straightjacket and rendering developers unable to experiment actually does produce a better outcome from a practical perspective, however. For one, when you suppress innovation, all you get is a black market in the tools of innovation: so now you have the most ambitious people doing risky things in the dark while not asking for help and guidance. Secondly, look at all the companies with iron-fisted operations and security groups that still get hacked.

In considering all of this, and after working in numerous different environments, my take is that the best approach for a company in which there is development freedom is to establish (a) a fortified network with (b) monitored deployments and (c) gentle automated enforcement.

A Fortified Network

A good example of what I mean by this is a Virtual Private Cloud (VPC) as implemented in AWS. Access to servers in the VPC can be funneled through bastion servers and access to applications made via Elastic Load Balancers. In the former case heavy-duty hardening and specialist security knowledge can be restricted to the setup of a few key systems, while in the latter case firewalling and related security concerns are outsourced to Amazon. In either case access can be further restricted via standard network or firewall approaches.

There are many other approaches to achieve much the same end, not all of which use clouds, but the end result will be similar. If developers decide to deploy an experiment, it will be invisible and inaccessible to the world, lurking behind the network fortifications and thus far safer than would otherwise be the case.

It is worth noting that another important aspect of more sophisticated cloud systems such as AWS is the ability to easily enumerate and control access to APIs from deployed servers without placing any sort of secret or key upon that server. No deployed application needs to be able to perform a fraction of the operations that an authorized development IAM user can within AWS, and keeping deployments locked down to the minimum of needed permissions throws up additional barriers to any attacker who gets as far as gaining access to a cloud server within the VPC.

Monitored Deployments

Within the fortified network, every deployed server should be monitored. It should be possible to obtain a list of all running servers in the network, and every server should contain an agent of some sort that exposes or reports useful information about the server inventory. In AWS it is also possible to tag and inventory deployed instances and other resources independently of their contents. There are many network monitoring systems, open source and commercial, that provide useful capabilities including agents on the servers that report status and server inventory. It isn't hard to roll your own agent if the requirements are simple: write a bash script that reports distribution, open ports, and installed packages and hook it up to xinetd, for example.

Access to agent packages and base boxes or images or distributions including the agent are provided to the organization, and every group is expected to build their own deployments from that basis - and of course always deploy into the fortified network. Beyond that, they are free to do as they will, experiment as they see fit, and otherwise have fun.

Gentle Automated Enforcement

All of this reported information will wind up in one service inside the fortified network for processing. That service can flag servers without an agent, or servers lacking a necessary patch to a critical package such as SSH, or other undesirable circumstances. Automated warnings can be issued and servers removed from the network if the situation isn't repaired within a few days. This works well to manage expectations without making anyone feel singled out or put upon, especially if warnings are tied to clear, easily available documentation.

The agent approach combines well with third party scanning services and tools, which can be aimed at exposed network addresses and applications and run regularly. Warnings can be automated and connected to underlying systems so that the right developers are notified that they have possible issues. It also works well in situations like the Heartbleed OpenSSL issue where almost everything in the average company's infrastructure is found to have a critical issue requiring immediate patching. It is tremendously helpful to be able to have confidence in where exactly progress stands in a rapid, widespread patch effort.

I've seen this sort of approach work well in a company where a great deal of freedom of action was provided to individual development groups. That said, a lot of work is involved in building a comprehensive monitoring system and the automation to go with it, even though many of the component parts exist in mature forms, but being in AWS or a similar cloud system does make it easier. It is also the case that some monitoring and enforcement is a lot better than none: the full envisaged monitoring system can be built out incrementally with benefits accruing at each stage.