Designing to Fail, Developers Win
Our story starts somewhere around the meteoric rise of iOS and Android devices -- say circa 2009 -- and how they quickly grew to be the primary method people use to interact with technology. "Apps" quickly became a household name, and individual developers or small teams were publishing their apps faster and to a larger audience.
As the app revolution continued, companies like Facebook and Twitter noticed their users began primarily interacting and relying on mobile devices, which eventually forced organizations to shift to a mobile-first strategy.
It is fair to say that the IT world is now application centric.The reality is our applications are always on and always within reach, so they need a reliable place to run their backend code.
Where do they live?
Data centers are expensive and hard to run. This new kind of application needed access to large amounts of resources when usage was high, but less during slow times. Applications needed agility and the flexibility to scale.
In 2006, Amazon released its first set of Amazon Web Services, which included the Elastic Compute Cloud enabling developers to access servers and storage on demand. The new resources were ready within minutes, and when they were no longer needed, they went away and billing stopped. Agility.
Fast-forward a few years. New players emerged in the cloud computing space like Microsoft’s Azure, Rackspace and Google Compute Engine.
The emergence of running infrastructures for enormous web properties, like Netflix, created an explosion in number-crunching Big Data workloads running the cloud and required the scalability to grow from the smallest development machine to huge clusters. Of course, it was only a matter of time before CIOs in enterprises and government organizations started to test the waters and entrust the cloud with their apps.
What's the catch?
Instant gratification for infrastructure, predictable pricing. There must be a catch…
Cloud computing is built on top of lower cost commodity hardware, and every individual component is expected to fail routinely. Amazon decommissions and recycles thousands of SSD hard drives per day. Racks of servers are taken offline for maintenance.
That means applications must be designed to fail.
Customers are usually unable to manage service outages due to tasks like network maintenance in a given data center - the expectation set in cloud computing is that your virtual server may be briefly interrupted and quickly restarted in another part of the network.
This is a huge contrast to traditional data centers, where every infrastructure component was extremely expensive, contained dual or triple redundancy for each network device's electronics, and included another duplicate device to take over in the event of a total failure.
The applications in the legacy data center were designed with the assumption that the network would never be down, and most network professionals spend their career aiming to achieve this goal.
Now that developers call up infrastructure “on demand,” often outside of their company's private data center, the rules changed.
With great power...
To realize the benefits of cloud-native applications, an organization must shift resilience responsibility from IT infrastructure teams to developers.
Applications are becoming more distributed. Many workloads are stateless, and can quickly recover from major changes or outages in infrastructure.
Heroku, a pioneer in cloud services for developers, published a 12-step detailing the twelve steps developers should take to design a resilient, scalable application for today's infrastructure.
How about monitoring?
For many of us, a number of factors fall outside of these twelve steps, creating additional challenges and complexities. For example, companies are dealing with legacy apps dating back twenty years or more that need to interoperate with the latest micro-service-architecture, web-scale, client-side-rendered applications. Or part of a company’s service portfolios runs on one cloud provider, some at a traditional outsourcer, and others in house.
Some IT shops are starting to replicate cloud style computing infrastructures in their own data centers, with self-service access to resources. Major suppliers like Cisco, Juniper and VMware are beginning to develop turnkey solutions to enable this style of application hosting, or what we know as private cloud.
As connectivity has gone from convenience to mission critical, it has never been more important to accurately and regularly monitor and measure everything about your application’s performance. The trouble with having two decades of apps running on disparate, varied cloud platforms, however, is that it’s incredibly challenging to look across diverse infrastructures to understand performance in a holistic way, and that’s because most performance monitoring solutions were designed in a pre-app, pre-cloud world, without the speed or scalability to keep up with today’s networks.
Performance monitoring tools for our connected, app-driven world need to go one step further and deliver speed at scale. As new technologies and systems are incorporated into our network infrastructure, these solutions should be able to monitor the new types of metrics that will inevitably arise, and they should present performance data in a way that gives actionable context and meaning so that you can visualize and understand where data is coming from and how it relates to your overall performance.
Staying a step ahead of network issues means using performance monitoring tools that aren’t older than those legacy apps you have – just because it’s designed to fail, doesn’t mean it has to.
How can you decrease the likelihood of application and service failure?
Download a free whitepaper on 6 Steps to an Effective Performance Monitoring Strategy.