Cloud Architecture Part Five: Avoiding a single point of failure
(This is the fifth in our series on cloud architecture)
It’s obvious really; a ‘single point of failure’ is any point in your computing system where you have only one component doing the job. In short, you have no Plan B if something goes wrong.
The problem with having a single point of failure built in to your system is that, despite your best efforts, from time to time components fail. It could be an application server, a web server or a database server, for example. What’s more, you’d swear computer systems somehow have a way of finding any single points of failure and breaking them.
I’ll share a good example with you. Recently, an organisation I’m familiar with had a site that followed cloud architecture best practices – multiple servers, monitoring, load balancers, N-tier backups, solid cloud security – they’d gone the whole nine yards, so to speak. There was, however, only a single instance of the database. So when that server was temporarily unavailable, the entire site came down.
It’s such a simple thing to avoid and it’s so costly in terms of productivity. The best course is to avoid, at all costs, any single point of failure in your system.
Here are some tips to avoid these types of outages:
- Have two of everything – when you design for this, you get a bonus: scalability becomes much easier as your application grows.
- Go for N-tier – it’s going to make building and testing systems much simpler.
- Load balance the web servers on HTTP and HTTPS – I’ll talk about them more in the next blog post. If you need HTTPS, encode any variables in cookies or on the page itself so you can maintain stateless web servers.
- Your web server should be configured to talk to the application server pair – if that is not an option with your application, configure them to talk to a single one and make sure the health check on the load balancer looks for something that would show the application server is up as well.
- As for avoiding a single point of failure in a database, as in the example above, that warrants an entire blog post of its own, so stay tuned.
By way of example, here’s my architecture for two different, readily available systems:
- Application: World-famous blog
- Recovery Time Object: Near instant
- Recovery Point Objective: 15 minutes
Approach: Configure the load balancer to have two servers in the pool. Configure one as ‘Active’ and the other as ‘Passive’. Set a custom monitor to watch for a result that shows the database server MySQL1 is responding. Ninefold Support can help you set up this custom configuration. At this point, if Web1 or MySQL1 is down, the load balancer will divert traffic to Web2 and subsequently MySQL2. To keep MySQL2 up to date, a management server runs a job every 15 minutes to dump the database to file, copy it to MySQL2 and restore it. This will mean you might lose up to 15 minutes of data. In a fairly static site, this is probably fine.
- Application: E-Commerce site
- Recovery Time Object: 15 minutes
- Recovery Point Objective: 0 minutes (no data loss)
Approach: Configure the load balancer to have two servers in the pool. If Web1 is down, the load balancer will divert traffic to Web2. If MySQL1 goes down, the management needs to make a few changes. First, promote the slave to the master. Next, change the configuration file on the Web1 and Web2 to connect to MySQL2. If MySQL1 comes back up again, don’t worry, MySQL2 will not exchange data and the web servers won’t access the stale data due to the configuration file change. These changes could be manual or you could monitor this via Nagios and run a script as a result.
Either way, you should certainly test the various scenarios and scripts before moving to production. This will greatly help build a cloud solution that keeps running regardless of any things that go bump in the night.
The 12 Cloud Architecture Principles
- Shared nothing architecture
- N-tier is for wimps
- Preproduction: Get one
- Monitor everything
- No SPOFs
- Load balancers are king
- Design with security in mind
- Who & where are the players?
- Power to the people
- Test it. But not for what it’s supposed to do
- Scale the easy things
- Understand your limitations. Experience usually means lot of tears