Failure, Repeated Failures is a myth — Failure as a Service

11 min readJul 22, 2019

Failures are very fatal for any services or systems and customers cannot wait for you to fix the problem and your review goes down and they move to your competitors. As mentioned in Murphy’s Law

“Things will go wrong in any given situation, if you give them a chance” and “Whatever can go wrong, will go wrong”

The Failure Services will get more attention and you can hear more about in upcoming days, and it was highlighted in Forbes Article as one of Top 10 trends in Digital Transformation of 2018 and believe the Failure service influences how the software will be developed and tested.

The intent of this article to promote to make failure a first-class citizen for cloud and non-cloud services, that is rather than waiting for unexpected failures to happen in production. This concept can be applicable to smaller projects and smaller enterprises.

Industry Downtime for Failures

The industry average cost of IT downtime is dependent on a lot of areas. The monetary losses vary when you consider your revenue, industry, the actual duration of the outage, the number of people impacted, the time of day etc. the following are the some of the statistics shows how much the industry loses business.

As per Gartner Analysis, Each year the World business lose over $5,600 per minute and up to $300,000 per hour in web application downtime

Average cost per hour of enterprise’s server downtime worldwide in 2017–2018

The statistic shows the average hourly cost of critical server outages, according to 2017, as of December 2017, 24% of respondents worldwide reported the average hourly downtime cost of their servers as being between $300,000 to $400,000.

https://www.statista.com/statistics/753938/worldwide-enterprise-server-hourly-downtime-cost/

Average cost per hour per server downtime worldwide in 2017, by vertical industry (in million U.S. dollars)

But there are other costs along with actual server downtime, that’s cost of manual interruptions by IT engineers. The downtime is extremely expensive, and in ways that can make or break the success of your organization and business services. At the same time, it is not 100% un avoidable but can avoid by implementing a seamless failure process with automation. There are various ways with tools can be automate the failure testing process, the subsequent sections detail out the process.

Famous System Failure in Big Business

Following are few big businesses unable to perform required functionality due to failure

HSBC Suffers IT Outage — HSBC Online Banking suffered an IT outage that caused significant disruption to its customers, especially those with a Banking for Business account. looking to use their online business banking accounts were severely disrupted.
Airbus Software bug alert — Airbus issued an alert for urgently checking its A400M aircraft when a report detected a software bug that had caused a fatal crash.
Ola App — The two bugs that have been unearthed allow unethical individuals with basic programming knowledge to enjoy unlimited free cab rides.

For more failure business, please refer here

Types of Failures

Failure can be in application’s infrastructure, network or API where errors typically occur that cause the downfall of an application. The following are the few causes of failure

Human Error — From miscommunication to insufficient training and identifying the fail at the end of the deployment lifecycle etc, that leads to IT failure in an application. Take a example: Banking require a highest level of regulatory oversight, isn’t immune to human error. Ulster Bank, which is owned by RBS has caught the ire of its customers many times over last few years. Many claims this is the result of poor governance. The most recent incident, caused by a single employee, left thousand of customers unable to see or access monies in their accounts , find more details here
Insufficient Testing — While in the process of migration from legacy systems, TBS Bank experience an extreme outage scenario leaving 1.9 million customers without access to their accounts for a total of three days. This could have been avoided if machine learning lead automation in place to identify the root cause and pinpoint the problem
Code Error — Failure points are a fairly common in web application, the developer can write a vulnerable code that leads to creating many failure points like Database access code, file server access, API access, sending a message to third party components etc
Failure during Software update: Nest thermostat freeze: Software update for the thermostat went wrong and literally left users in the cold. When the software update went wrong. It forced the device’s batteries to drain out, which led to drop in temperature.
Insufficient Test Data- Due to lack of test data, the test teams unable to test all the scenarios, that causes the application failure and unsupported file formats
Configuration Error — Wrong configuration are the source of the major errors in sites. Very often configuration values are not tested as part of the application. The configuration error can result in severe failures, as they are often associated with configurations used to control critical situations such as fail-over, backup, error and exception handling, auditing, security etc. their detection is often too late to limit the failure damage, you can find more here
Hardware Errors — Non-availability and non-compatibility with the device and application, incorrect address, time out issues etc.

There are many more type of failure can occur in any type of systems, you will never know how stable your platform is until failure happens.

How to avoid Failures

It is true 100% elimination of all failure scenarios is highly impossible but can reduce larger part of it by doing failure test. The following sections details out the how? What? When? Where?

Failure testing is an approach that allows Architects, designer, engineers, tester to discover weak spots that can lead to failure. The probability of failure parameter plays a major role in understanding the health of an application. Some features in an application can have a high probability of failure compared to the other features. Let’s consider an application site which will have one functionality logging in as a user and other to determine the user browsing history, as a tester you can identify the second functionality as a higher probability of breaking because it contains lot of business rules and logic involve compare to login page.

In the era of cloud computing, you consider cloud computing is pervasive, but cloud service outages still take place. The main reasons for major outages still occur is that there are many unknown large-scale failure scenarios in which recovery might fail.

According to Netflix Engineers “there are many unknown real-production scenarios in which a failure recovery might not work”. Amazon leveraged the “GameDay” exercises that inject real failures like EC2 failures, power outages in regions etc.

The basic principle of failure testing

Define your system’s Normal Behavior — It is important to define your state of application and services, without defining, how will you measure and against what parameters
Test Continually — Use tools to test randomly and continuously with various scenarios throughout the lifecycle of application. Continually helps the team automatically identify issues and allows the team to spend more time to build software’s features.
Creation of Hypothesis — Always hypothesize on the expected outcomes of event before running it live production. Look into all possible scenarios
Always have a rollback plan — Always have a backup plan because things can go wrong. Plan to revert the impact of the disaster.
Always Automate — Automate Failure scenarios with various tools (will describe in next section) as part of the CI/CD pipeline
Shift Left Failure — Continuous testing and continuous deployment involves automated tests and running those tests as early and often during the development. Shift left operation is to work side by side with development and test activity.
Similar environment in cycle — Create all environments in the DevOps pipeline looks as much like production as possible. This can be possible by using the cloud environment
Define consistent environment — Define a consistent environment across lifecycle, this eliminates the failure that occur simply by configuration inconsistencies.
Fail Fast — It makes bugs and failures appear sooner, Bugs are earlier to detect, easier to reproduce and faster to fix and cost of failures are reduced.
Failing Securely — When a system fails, it should do securely. It involves several things, secure defaults, always check return values for failure, confidentiality and integrity of system etc

How to test Failure scenarios

There are many ways to test a failure,

Failure as a Service
Shift Left with Continuous Delivery
Chaos Engineering.

FaaS (Failure as a Service) Architecture

A group or researchers from the University of California at Berkeley has offered a Failure as a Service (FaaS) Model for introducing common failures into large-scale distributed cloud service.

The Architecture includes four main components

FaaS Controller: A Service that sends failure commands to the agents that run in VMs
FaaS Agent: A agent must be installed in all the machines or services and it receives a failure commands from controller and initiate the instructions
Target System: The target system to be part of the architecture which services will be tested
Monitoring Service: A dashboard that monitors the health and behavior of target system/services/VM/container and collects information and create a visualization.

FaaS Architecture

Non-Production testing — shift left testing with DevOps Pipeline
Failures should be deliberately injected into actual deployments into real production systems, which has proven effective many ways. The exercise drives the identification and mitigation of risks from failures, builds confidence in recovering systems after failures under load and stress

The critical of Failure as a Service engineering is treated as a scientific discipline. It uses certain strict discipline process to work. The following four steps must be followed if project teams and clients if they want to adopt Failure services.

Form a hypothesis — Ask yourself, “What could go wrong?”
Plan your experiment — Determine how you can recreate that problem in a safe way that won’t impact users
Minimize the blast radius — Start with the smallest experiment that teach you something
Run the experiment — Make sure to carefully observe the result

Non-Production Testing with shift left approach

As I said, you will never know how stable your platform is until failure happens, instead of waiting at the end for a failure to occur, be proactive and make failure a part of the platform. When failure occurs, automated procedure should occur to remediate problems.

DevOps Pipeline with Failure as a Use Case

The below diagram depicts the how each Microservices project can be implement Failure as a Service (use case) and inject with DevOps pipeline, this helps us to test your services effectively for failure use cases.

DevOps Pipeline with Trouble Maker

The Trouble Maker is an open source tool can be configuring both as a DevOps Pipeline (created shift left with Trouble Maker) and also deploy in a production and non-production environment and can run in cloud and non-cloud environment.

The Trouble Maker is a Java Spring Boot application that communicates with a client service that has a small Servlet registered with a Java API based service application, this Spring Boot application configure as a Cron Job to the DevOps pipeline and can be trigger automatic or on demand way.

By default, Trouble Maker access Netflix Eureka to discover services and based on a Cron task but can be configure with other service registry too. For more information, please visit here

Turn Failure into Resilience and implement Failure as a Service

Gremlin is a framework that provides configuration to safely, securely and simulate outages by producing various attack scenarios. It offers several different attacks that you can use to inject failure into your application.

The Gremlin allows the controlled injection of resource, network and state failure so that project team can view the health and behavior of the system. The project team can also do “undo”, it automatically cleans up if things go wrong.

The main categories of Gremlin are

Resources — Test your application of critical resources
State — Change the state of the environment your application is running within
Network — Simulate the inherently unreliable behavior of the network
Request/Response — Impact individual request/response as they hit API

The following Gremlin dashboard to use to send commands on type of attack, client, users and team report.

Create Attack — When you create an Attack, you need to choose type of attack and choose the target and specify host or containers. You can create immediate or schedule an attack.

Active Attacks — You can see what the active attacks are, and you can unleash the attack.

Schedules — You can schedule an attack by time and by days

Reports — You can check the report by date range and also by month and will display Summary report and client statistics report.

The Gremlin provides 12 different attack types:

Resource Exhaustion (Memory, CPU, Disk, IO)
Network Issues (Black Hole, Latency, Packet Loss, DNS)
Behavior related (Time travel, process killer, Application level, Shutdown)

Gremlin Compatibility

The Matrix that Gremlin support on various Linux platform, you can find here

Chaos Engineering

Chaos Monkey developed by Netflix engineers that terminates virtual machine instances and containers that run-in production environment. It doesn’t run as a service like Gremlin and Trouble Maker but can be deployed in the cloud like other cloud services. Chaos monkey is a part of Simian Army.

The latest version of Chaos Monkey integrates with Spinnaker a Multi cloud Continuous delivery pipeline platform for releasing software changes with high velocity and confidence.

The Chaos Monkey allows planned instance failures when you and your team handles them and also you can schedule a termination and it encourages the redundancy and most important is built on Spinnaker that works across multi cloud environment (i.e. if your application deployed on GCP and AWS, you can use Chaos Monkey with Spinnaker to attack your application).

The main problem of the Chaos Monkey, it requires a MySQL database to manage the attack and also does not bring into normal status if something went wrong (no recovery capability) and does not have vast failure mode and behavioral mode. The more information you found here

Benefits of Failure as a Service implementation

Reduce the hypothetical defenses against failure
See how system behave in case of failure and can avoid unavoidable situations
Save time and cost on failures

Cons of Failure Testing

Analyze the application behavior, it means you cannot get direct yes or no, you need to analyze what went wrong based on the output
Costs — The Application team need to add additional cost for failure testing but it is worth by compare actual downtime.
Customer Data — This is one of the biggest concern as you are testing in real time so need be extra care of data.

Conclusion

Failure testing is a prominent approach which helps clients, application team ensure testing done in advance for the failure scenarios, that leads to stable application performance in production, this will reduce further maintenance cost.

Failure as a Service model helps lot on improving performance of cloud and non-cloud service applications, as they help project team to fix the problem.

Shift Left Failure testing can be implemented in advance before moving into production and Trouble Maker can be implemented as part of Continuous Delivery pipeline with Jenkins and Spinnakers.