Generate Automated Resilience Testing Scenarios

This guide will walk you through generating fault resilience scenarios that you can run automatically to validate the capability of your endpoints to deal with network issues.

Prerequisites

Install fault

If you haven’t installed fault yet, follow the installation instructions.
Scenario Reference

You might want to familiar yourself with the scenario reference.

Create Single Shot Scenarios

In this guide, we will demonstrate how to create a single scenario against the fault demo application. Single call scenarios make only one request to the target endpoint.

Start demo application provided by fault
```
fault demo run
```

Create the scenario file

The following scenario runs a single HTTP request against the /ping endpoint of the demo application. That endpoint in turns make a request to https://postman-echo.com which is the call our scenario will impact with a light latency.

scenario.yaml

---  # (1)!
title: "Add 80ms latency to ingress from the remote service and expects we verify our expectations"
description: "Our endpoint makes a remote call which may not respond appropriately, we need to decide how this impacts our own users"
items:  # (2)!
  - call:
      method: GET
      url: http://localhost:7070/ping
    context:
      upstreams:
        - https://postman-echo.com  # (3)!
      faults:  # (4)!
        - type: latency
          mean: 80
          stddev: 5
    expect:
      status: 200  # (5)!
      response_time_under: 500  # (6)!

A scenario file may have as many scenarios as you want
You may group several calls, and their own context, per scenario
This is the host impacted by the latency
You may apply multiple faults at the same time
We do not tolerate the call to fail
We expect to respond globally under 500ms

Create Repeated Call Scenarios

In this guide, we will demonstrate how to create a repeated scenario against the fault demo application. Repated call scenarios make a determinitic number of requests to the target endpoint, with the possibility to increase some of the fault parameters by a step on each iteration.

Start demo application provided by fault
```
fault demo run
```

Create the scenario file

The following scenario runs several HTTP requests against the /ping endpoint of the demo application. That endpoint in turns make a request to https://postman-echo.com which is the call our scenario will impact with a light latency.

scenario.yaml

---  # (1)!
title: "Start with 80ms latency and increase it by 30ms to ingress from the remote service and expects we verify our expectations"
description: "Our endpoint makes a remote call which may not respond appropriately, we need to decide how this impacts our own users"
items:  # (2)!
  - call:
      method: GET
      url: http://localhost:7070/ping
    context:
      upstreams:
        - https://postman-echo.com  # (3)!
      strategy:  # (4)!
        mode: repeat
        step: 30  # (5)!
        count: 3  # (6)!
        add_baseline_call: true  # (7)!
      faults:  # (8)!
        - type: latency
          mean: 80
          stddev: 5
    expect:
      status: 200  # (9)!
      response_time_under: 500  # (10)!

A scenario file may have as many scenarios as you want
You may group several calls, and their own context, per scenario
This is the host impacted by the latency
The strategy block defines how fault should run this scenario's call
The step by which we increase latency on each iteration
How many iterations we iterate
Do we have a baseline call, without fault, at the start?
You may apply multiple faults at the same time
We do not tolerate the call to fail
We expect to respond globally under 500ms

Create Load Test Call Scenarios

In this guide, we will demonstrate how to create a load test scenario against the fault demo application. Load test call scenarios make a number of requests to the target endpoint over a duration.

Warning

fault is not a full-blown load testing tool. It doesn't aim at becoming one. The facility provided by this strategy is merely a convenience for very small load tests. It can prove very useful nonetheless.

Start demo application provided by fault
```
fault demo run
```

Create the scenario file

The following scenario runs several HTTP requests against the / endpoint of the demo application.

scenario.yaml

---  # (1)!
title: "Sustained latency with a short loss of network traffic"
description: "Over a period of 10s, inject a 90ms latency. After 3s and for a period of 2s also send traffic to nowhere."
items:  # (2)!
  - call:
      method: GET
      url: http://localhost:7070/
    context:
      upstreams:
        - http://localhost:7070  # (3)!
      strategy:  # (4)!
        mode: load
        duration: 10s  # (5)!
        clients: 3  # (6)!
        rps: 2  # (7)!
      faults:
        - type: latency
          global: false  # (8)!
          mean: 90
        - type: blackhole
          period: "start:30%,duration:20%"  # (9)!
    slo:  # (10)!
      - type: latency
        title: "P95 Latency < 110ms"
        objective: 95
        threshold: 110.0
      - type: latency
        title: "P99 Latency < 200ms"
        objective: 99
        threshold: 200.0
      - type: error
        title: "P98 Error Rate < 1%"
        objective: 98
        threshold: 1

A scenario file may have as many scenarios as you want
You may group several calls, and their own context, per scenario
This is the host impacted by the latency
The strategy block defines how fault should run this scenario's call
The total duration of our test. We support the following units
The number of connected clients
The number of request per second per client
Inject latency for each read/write operation, not just once
Schedule the blackhole fault for a period of the total duration only
Rather thana single status code and latency, we evaluate SLO against the load results

The load strategy is powerful because it allows you to explore the application's behavior over a period of time while keeping a similar approach to other strategies.

Notably, you should remark how we can apply the faults with a schedule so you can see how they impact the application when they come and go. You should also note the use of SLO to review the results in light of service expectations over period of times.

Please read more about these capabilities in the scenario reference.

Generate Scenarios from an OpenAPI Specification

This guide shows you can swiftly generate common basic scenarios for a large quantity of endpoints discovered from an OpenAPI specification.

Info

fault can generate scenarios from OpenAPI v3.0.x and v3.1.x.

Generate from a specification file

fault scenario generate --scenario scenario.yaml --spec-file openapi.yaml

Generate from a specification URL

fault scenario generate --scenario scenario.yaml --spec-url http://myhost/openapi.json

Generate one scenario file per endpoint

fault scenario generate \
    --scenario scenarios/ \  # (1)!
    --spec-url http://myhost/openapi.json
Generated 24 reliability scenarios across 3 endpoints!

Pass a directory where the files will be stored

This approach is nice to quickly generate scenarios but if your specification is large, you will endup with hundreds of them. Indeed, fault will create tests for single shot, repeated calls or load tests. All of these with a combination of faults.

We suggest you trim down only to what you really want to explore. Moreover, you will need to edit the scenarios for placeholders and other headers needed to make the calls.

Below is an example of a generated scenarios against the Reliably platform:

title: Single high-latency spike (client ingress)
description: A single 800ms spike simulates jitter buffer underrun / GC pause on client network stack.
items:
- call:
    method: GET
    url: http://localhost:8090/api/v1/organization/{org_id}/experiments/all
    meta:
      operation_id: all_experiments_api_v1_organization__org_id__experiments_all_get
  context:
    upstreams:
    - http://localhost:8090/api/v1/organization/{org_id}/experiments/all
    faults:
    - type: latency
      side: client
      mean: 800.0
      stddev: 100.0
      direction: ingress
    strategy: null
  expect:
    status: 200

Pass Headers to the Scenario

In this guide, you will learn how to provide HTTP headers to the request made for a scenario.

Start demo application provided by fault
```
fault demo run
```

Create the scenario file

The following scenario runs a single HTTP request against the /ping endpoint of the demo application. That endpoint in turns make a request to https://postman-echo.com which is the call our scenario will impact with a light latency.

scenario.yaml

---
title: "Add 80ms latency to ingress from the remote service and expects we verify our expectations"
description: "Our endpoint makes a remote call which may not respond appropriately, we need to decide how this impacts our own users"
items:
  - call:
      method: GET
      url: http://localhost:7070/ping
      headers:
        Authorization: bearer token  # (1)!
    context:
      upstreams:
        - https://postman-echo.com
      faults:
        - type: latency
          mean: 80
          stddev: 5
    expect:
      status: 200
      response_time_under: 500

Pass headers as a mapping of key: value pairs. Note in the particular case of the Authorization header, its value will not be shown as par of the report but replaced by a placeholder opaque string.

Make Requests With a Body

In this guide, you will learn how to pass a body string to the request.

Start demo application provided by fault
```
fault demo run
```

Create the scenario file

The following scenario runs a single HTTP request against the /ping endpoint of the demo application. That endpoint in turns make a request to https://postman-echo.com which is the call our scenario will impact with a light latency.

scenario.yaml

---
title: "Add 80ms latency to ingress from the remote service and expects we verify our expectations"
description: "Our endpoint makes a remote call which may not respond appropriately, we need to decide how this impacts our own users"
items:
  - call:
      method: POST  # (1)!
      url: http://localhost:7070/ping
      headers:
        Content-Type: application/json  # (2)!
      body: '{"message": "hello there"}'  # (3)!
    context:
      upstreams:
        - https://postman-echo.com
      faults:
        - type: latency
          mean: 80
          stddev: 5
    expect:
      status: 200
      response_time_under: 500

Set the method to POST
Pass the actual body content-type.
Pass the body as an encoded string

Bring on your SRE hat

When running scenarios with a load or repeat strategy, we encourage you to bring SLO into their context. They will give you invaluable insights about the expectations that could be broken due to a typical faults combination.

  slo:
    - type: latency
      title: "P95 Latency < 110ms"
      objective: 95
      threshold: 110.0
    - type: latency
      title: "P99 Latency < 200ms"
      objective: 99
      threshold: 200.0
    - type: error
      title: "P98 Error Rate < 1%"
      objective: 98
      threshold: 1

fault supports two types of SLO: latency and error. When a scenario is executed, the generated report contains an analysis of the results of the run against these objectives. It will decide if broke them or not based on the volume of traffic and duration of the scenario.

Next Steps

Learn how to run these scenarios.
Explore the specification reference for scenarios.