What’s needed to reliably release large-scale web applications? How do you ensure zero downtime (and quick recovery) for mission critical apps? Last week I participated in an online panel on the subject of Strategies for Deploying Mission Critical Web Applications at Scale, as part of Continuous Discussions (#c9d9), a series of community panels about Agile, Continuous Delivery and DevOps. Watch a recording of the panel:
Continuous Discussions is a community initiative by Electric Cloud, which powers Continuous Delivery at businesses like SpaceX, Cisco, GE and E*TRADE by automating their build, test and deployment processes.
Below are a few insights from my contribution to the panel:
The people factor
"At the BBC when I was working there, people were very important. Taking pride in what you do, always clean coding, making sure it’s cleaner than the last time you left the code. Another side is having people on the team who think differently. Having people who think the same way isn’t great for a mission critical app."
"For the BBC Live system, the technical lead was great, he thought out of the box while were stuck in the code. He saw the big issues that might occur down the line. One example was a long tail problem – Google crawls your website, we had live pages that can make hundreds of HTTP calls, being kept live by a Google bot, that can bring the system down pretty fast."
"We had to be defensive against that in the system design very early on, to make sure that never happened, making sure pages were cached for longer, if they were not live for 24 hours we’d cache them for an hour or two, and so on. Nobody has time to test and discover these types of edge cases in the day-to-day of development, so you have to think out of the box and see it coming."
When you have creative people who think differently from each other, and you all keep thinking, ‘how can I break it’, that’s extremely important. You should spend time as a team discussing what might break the site, that’s great for a large scale system.” The process factor
"The BBC is an 80-year-old corporation, so systems you have to support can definitely be considered legacy. In terms of process, every year new products are introduced, for example, when premiere league football was announced, there was a very lengthy, well documented process how get it out. You might have messaging, might have to turn things off in a certain order, go from test-stage-live, it was a lengthy process that was practiced well in advance."
The tools factor
"In the system I worked on, monitoring was a key aspect, we had standardized alerting, aggregated data from many different systems, we’d report on error codes in a standard way. It became an early warning system for any system in the BBC, because it would alert as soon as there was a problem in the system itself."
"We were addicted to tools. We were using Docker containers to build lots of different things; dashboards, helpful tools help to debug things, get live information because there are live events and you need to know what’s going on to solve problems. We invested a lot of resources in tools, in fact we had a tooling team who created many different tools."
“The most important thing in a large corporation is communication, teams are encouraged to talk to other teams if they think they’re doing something similar. You meet someone in a put and learn about duplication of efforts, so you say, ‘hey, we’re doing the same thing, let’s have a meeting and talk about it.’ Having a wiki where you can search and see what other people are doing can be very helpful."
"At the BBC we didn’t do that, we were on the legacy system, we were working towards the cloud but hadn’t gotten to blue-green deployments. But what we do is perform deployments to each server in turn, taking it out of the pool, putting it back in the pool. It’s not blue-green , but it is a rolling deployment."
Feature flags and dark launches
"We used dark launches quite a bit, as a corporation we couldn’t just show things to the user and then take them away, people care passionately about the product and even if you change the color, even making small changes could seriously affect the product."
"So we’d have a dark launch, say for changing the styling on a page, or a new system on the back-end that is not critical, for example we had a hot cache layer behind BBC Live that could be turned on and off, we used feature toggles for caching as well, we toggled cache times, because initially it was designed for a client polling mechanism, and if we were under load we increased the cache time, fixed it and brought cache time back down."
"You have to pick and choose when you decide to use a feature toggle or a dark launch when working on very large systems – using feature toggles for certain things is just too complex."