Software testing typically involves a variety of techniques and methodologies in order to test every aspect of the software, including functionality, performance, issues, and security. This is where stress testing comes in.

Resiliency testing is essential to ensure that applications run smoothly in real or chaotic conditions. It tests the resilience of the application and its ability to withstand difficult situations.

So, we asked industry experts to explore the role and importance of stress testing.

What is the impact test?

According to Paul Davison, Managing Director of the Seriös group, stress testing is a non-functional testing technique, which consists of testing the ability of a solution to continue to provide an acceptable level of service to the company in the event of stress and / or s ‘there are problems which affect it or several of the components of the system. It can also help us make sure we are better equipped to deal with and recover from failures.

Annarita De Biase, QA Manager at Soldo, adds that stress tests are a particular category of test whose target is the observation of the systems we are working on, under certain boundary conditions.

For example, do you always know what happens if only one of your main services is no longer available? What if it was the same with a non-“basic” service? Or if a server goes offline for some reason? Or if the data is not more accessible?

These are real nightmares for a company (developers, QA engineers, sysadmins, business people), but by somehow simulating them and observing how our platform reacts, we can be ready. to deal with and compensate for problems.

How to test resilience?

Annarita points out that stress tests are based on observability under particular conditions.

First of all, it requires business analysis. What are the essential features of your platform? What kind of data do you need for them?

Once this information is clear, it is possible to start planning resiliency tests. Well, “plan” is a particularly strange word in this case, as we’d like to “plan for chaos” or at least the greatest number of random incidents.

The next step is the simulation of chaos whose objective is to break the system. So hopefully in a dedicated environment very similar to production, people start shutting down some servers, or they inject malicious code into the system to simulate a hacker attack or they can make some basic data unavailable, or they can do what they think might cause trouble.

After these tests, all the simulation consequence data is collected and analyzed to plan the activities to deal with it.

On the other hand, Paul shares an analogy when it comes to impact testing: if you have a flat tire on your car and you know you have a spare tire in the trunk but:

  • Do you have the tools to change it?
  • Do you know how to change it?
  • Is the tire inflated?

You could go on a “run” and change the wheel to test this on your drive, but would it be the same to try to do it in the rain in the dark by a busy road?

Therefore, stress testing means exploring “what if” scenarios to determine what the impacts would be on system capacity in the event of failures. There are many ways to do this.

Netflix has taken an interesting approach to testing resilience by building Chaos monkey, a tool that randomly disables production instances to test common failure types without impacting the customer. The name comes from the idea of ​​unleashing a wild monkey in a data center to bring down servers and chew on cables.

Perhaps more interesting, he continues, is their approach to running these tests during a business day, with engineers on standby to troubleshoot issues. This allows them to learn lessons and create automatic recovery mechanisms for “what if” scenarios that would cause them significant problems. The success of this approach prompted them to expand the concept and they created a virtual ‘ape army’ comprising tools that test latency, identify nonconforming components, and verify component health.

By testing to identify weaknesses, they can proactively remove these potential points of failure, such as components not configured for autoscaling, correct them, and put them back into service before they cause problems or failures.

Still other organizations take a more structured approach using non-functional requirements. Typically, this involves testing each potential point of failure within a solution to validate that when a component fails, all requests made to it are redirected to one or more alternative components that perform the same function.

These tests would involve dropping elements of the solution to simulate a failure and would include scenarios to ensure that:

  • The alternative component (s) could process the required volume of requests within the required timeframe
  • Any relevant monitoring / alert tool reacts as expected
  • Recovery actions can be taken to return the failing component to service in a timely manner

Why and when should you use the impact test?

No system or application will run without failure indefinitely, no matter how well it is built / designed, Paul stresses.

Indeed, the key is to understand how well, if not if, the solution will perform under failure conditions. This means that service management teams are able to understand how long and how well the service can run without any given component being available. Testing also ensures that the team understands what needs to be done to recover from failure and has proven their ability to do so.

Resiliency is critical in any IT solution these days, he says, but individual organizations will determine how critical it is to their IT strategy based on their approach to risk. A retail store selling 2% of its products online may not want to invest a lot of money in building a fully resilient infrastructure and testing it.

Conversely, financial institutions that could be severely impacted in terms of reputation if their systems were to fail are very likely to prioritize resilience. In public sector areas, such as wellness or health care, a lack (or failure) of resilience within a solution could lead to human hardship or loss of life, resulting in again a high priority area from a testing perspective.

For Annarita, there are no valid reasons not to use stress tests. Maybe in the case of extremely simple platforms in very young organizations, they can be postponed, but with an architecture of more than a few departments or a number of users greater than 10, “resilience” becomes a real thing. need.

The advantages and challenges

Resilience testing gives you in-depth knowledge of your platform and allows testers to take quick action if something goes wrong, notes Annarita.

You can be well prepared for a very simple version, and then, during deployment to a production environment, or shortly thereafter, “something” happened, destroying all smiles from everyone involved. In test environments, all unit tests, all functional tests, all integration tests can be ok, and there is always something going on because of ‘something’, some random incident that we don’t. could not prevent.

Well, “chaos” cannot be totally planned out, but stress testing can help make our platforms more stable and “ready” and make our business more efficient with quick reactions.

Additionally, Annarita points out that, like any type of testing, the hardest part is planning.

As a technical QA Annarita enjoys being into architecture and DevOps stuff, and while she thinks planning is the hardest and most complex part of it, it’s also the most interesting. Developers, QA engineers, DevOps, business people, etc. must try to simulate chaos, and while they will never do 100% of the possible scenarios (as in all other types of tests), they must try. Collaboration is key to making sense of the “chaos” everyone is trying to cope with.

In a world where consumer expectations continue to rise, it is essential that organizations ensure that any problem / failure in their software or service causes minimal disruption and, where possible, is invisible to the customer. end user or customer. Therefore, Paul shows that undertaking stress tests will not prevent failures, but it does mean that when they do occur, you know how the system will perform and what corrective actions are needed.

Many organizations still consider resiliency built into the design to be sufficient. If there are two load balancers of the correct size, why do we need to test? They do not take into account the implications of network bandwidth or other downstream components.

Even in organizations where the need to undertake resilience is fully understood, there can be challenges. This is normally due to the fact that testing is unlikely to be fully undertaken by the test team. Stress testing is an important endeavor requiring a representative environment and skilled and knowledgeable resources to support testing. The necessary resources often come from the platform management teams and departments that take care of keeping the light on for the production platform.

The future of stress testing

For Paul, consumer expectations are growing, organizations providing systems or services that experience frequent failures or reduced functions risk losing customers. This, together with the shift to a cloud-based infrastructure, which brings new options in terms of resiliency capabilities, will help ensure that resiliency testing remains an area of ​​growth.

It is critical that organizations understand the risk areas within their solution and how they handle those “what if” scenarios.

Annarita believes “resilience” is more important than ever. We as human beings have faced a time in which we have made incredible sacrifices to continue with our lives even though there was a pandemic affecting the whole world. Now we have learned to live in a potentially different way, with the specific goal of returning to normal life, but with the awareness that we can live under different conditions.

Well, it’s the same with our platforms (software and hardware). We must act so that they are prepared for the unexpected, that they are strong enough and that their end users can continue to use them.

Special thanks to Paul Davison and Annarita De Biase for their insight on the subject!



Global cybersecurity and cloud security market is showing huge growth with major players NEC Corporation, Cisco Systems Inc., Dell Technologies Inc., Kaspersky Labs


Definition of "impact analysis" in software testing

Check Also