Pushing Past The Limits: Load & Performance Testing At cure.fit

Apr 13, 2022

8

min read

8

views

Home

/

The traffic to an app or website doesn’t always stay constant. In fact, sharp spikes in traffic can always be observed during sales, promotions, seasonality, etc. While most organizations predominantly focus on marketing and advertising to bring traffic to their app/website, they often underestimate the sudden load such events can put on all systems. As a result, the site may crash, technical glitches might occur, ultimately leading to poor customer experience, and let's not forget revenue loss.

This is where load testing plays a very important role. It helps organizations understand the potential capacity limits of its tech systems and architecture. It ensures that the app and web stay stable and perform normally even when a sudden surge in traffic occurs, and systems are pushed to their capacity -- making load testing an essential part of the app and web design and optimization process. Successful load testing can be advantageous to organizations in many different ways.

Like most digital businesses, cure.fit also experiences a surge in traffic often, especially during sales. And now, with our digital fitness offering, cure.fit Live, traffic to our website and app has increased 10 fold compared to the pre-COVID days. In this blog, we will talk about how our load and performance testing systems and processes evolved and changed many times to help us ensure a good app and web experience despite a sudden surge in customer traffic.

Here’s a glimpse of our current load test setup

We have implemented our distributed load test pipeline using spinnaker, which enables us to execute load tests with higher load test targets.

We create a dockerized image for executing our load test which does the following tasks

Execute the Gatling load test scripts using Maven
Include all the maven dependencies in offline mode

We have created a spinnaker pipeline to upscale the respective services (that were deployed using Kubernetes) before and downscale them after executing the load test to bring down the infrastructure costs.

Load tests are integrated with the dev continuous integration (CI) and continuous delivery (CD) pipeline so the load test will be executed to ensure that a specific commit will not result in a latency increase.

In case the commit is causing latency increase, the deployment build fails, and deployment will be rolled back

But, how did we get here? What did it take us and what did we learn along the way? Let us give you the whole story!

Our Previous Load Testing Process

In 2019, our three main verticals—cult.fit, mind.fit and eat.fit—had users coming to the app primarily for booking classes and ordering their meals. These transactions were distributed throughout the day, with each user spending nearly five minutes on the app. Our regular traffic pattern averaged at approximately 100 RPS (requests per second) and before every sale, we used to perform load tests for 5X the load based on our marketing campaigns.

However, executing the load tests was a highly manual process that required developers from various teams to upscale and downscale services on ASG (AWS auto-scaling groups) manually before and after the load test, respectively.

We used locust to execute load testing as it is an easy to use, scriptable, and scalable performance testing tool. Nevertheless, this system and our process started to fail as we expanded our business.

With business growth arose the need of a New Load Testing Infrastructure

As we started expanding our business, launching new verticals like care.fit, cult gear, and whole.fit, the traffic to our app and website started increasing significantly and customers started to spend more time on the app and website as they browsed through various products. Thus, we had to scale up our load testing infrastructure and tools to meet the new requirements.

For this, we needed a few tools that would align with our requirements and help us scale up quickly in the future. Below listed are the essential requirements we had in mind while selecting the new load test tool. The tool should;

Help reduce the infrastructure cost
Provide inbuilt test reports
Allow writing clear and readable tests
Have inbuilt integration with continuous integration pipelines
Have smooth integration with real-time monitoring tools like grafana
Be VCS-friendly so that we can check the code, access it, and use it across the company
Should offer performance trend charts to help us keep track of latency

Taking all these aspects into consideration, we decided to choose Gatling as our new load testing tool.

Let’s Take A Look At The Key Features Of Gatling

Gatling comes with the following advantages:

Developed in Scala, Gatling is built on ‍
Netty for non-blocking HTTP ‍
Akka for virtual users orchestration
An expressive self-explanatory DSL (Domain Specific Language) for test development
Scala-based production of higher load using an asynchronous non-blocking approach
The ability to use full support of HTTP(S) protocols for JDBC and JMS load testing
Multiple input sources for data-driven tests
Powerful and flexible validation and assertions system
Comprehensive and informative load reports

But With COVID-19 came unexpected scale and yet another change in the load testing methodology

While Gatling would have sufficed under normal circumstances, the pandemic brought an overnight change in scale. Since all cure.fit centers had to be shut down in March 2020, we had to move all our fitness classes online. Suddenly, our services were now available not only to those who went to our centers but to practically anyone with a laptop or mobile device and an internet connection. As a result our customer traffic became 10X. Not only did we have more people visiting, but also spending a considerable amount of time—nearly 30-50 mins—on our app and website, all at the same time.

As our existing load test infrastructure could only test max 700 RPS of load, hence, our existing load testing processes and tools were no longer supporting our use case and we needed a distributed load test environment.

Since we were already using spinnaker as our continuous delivery platform, we created a distributed load test pipeline using spinnaker.

To accommodate this large scale change in traffic, we adapted and tweaked our load testing process as listed below

The entire process to upscale and downscale the services for executing the load test was automated.
We created a spinnaker pipeline script to upscale the respective services (that were deployed using Kubernetes) before and downscale them after executing the load test to bring down the infrastructure costs
Distributed load test pipeline was implemented using spinnaker, which enabled us to execute load tests with higher load test targets.
We created a dockerized image for executing our load test

Our Docker image was developed to do the following:

Create Gatling load test scripts
Execute the tests using Maven
And include all the maven dependencies included in offline mode while creating the docker image itself. This saves time during test execution

Docker file

Load tests were integrated with the dev continuous integration (CI) and continuous delivery (CD) pipeline so the code would be deployed to pre-prod env, the load test will be executed to ensure that a specific commit will not result in a latency increase. In case it does, the build fails, and deployment will be rolled back
Performance trend-charts were implemented to keep track of latency
Real-time monitoring of load tests was enabled to measure latency/error rates to automate the killing of the test if the latency/error rate threshold is breached

And, the advantages of this new process?

No manual intervention is required to execute the load tests anymore
All the parameters (environment variables) are being passed on time using spinnaker UI, so it is much easier for anyone to execute load tests as per the requirements without making any significant changes to the code. While these are default values, if users want to override them, it can be done using the spinnaker pipeline

The code/bug that causes latency to spike during production can be caught in the pre-prod environment, preventing any outage or bad customer experience
The latency trends can be tracked for the past couple of days/weeks, which helps us narrow down the cause of latency spikes and figure out the commits that would have caused it
Monitoring the dashboards in real-time provides us valuable insights about the metrics like error rate/ latency spikes, which help us take proactive measures in time instead of waiting for a test to complete and then analyze the load test results
By using datadog to identify the performance bottlenecks, we get detailed insights on the performance metrics like error rate, CPU, and latency to figure out the downstream services which are causing the performance to degrade

Here’s what we learned

Load testing helped us figure out and fix the performance bottlenecks quickly instead of waiting for problems to appear during the actual spike in traffic, which saved us from potential revenue loss and poor app ratings.

Our recommendation, even if you are a small startup that receives low traffic, you must include load testing as a part of your development life cycle. The benefits it can bring along the way can be massive, like in our case. Also, when choosing load testing tools, opt for scalable ones so you won’t have to recreate the entire process as you grow.

Where do we go from here?

We plan to get load testing integrated into the dev CI/CD pipeline for all our services to ensure that no bad commits (that can potentially cause latency spikes) are deployed to the production environment.

‍

Posted

Apr 13, 2022

in

Engineering