Skip to content

Scenario Analysis

This guide will show you how to analyze scenario results, from an angle of resilience and reliability, using LLM.

The analysis aims at giving you a sound report of potential issues, threats and remediations to consider for your application.

Prerequisites

Windows not supported

Unfortunately, the agent feature is not supported on Windows because the framework used by lueur to interact with LLM does not support that platform.

Experimental feature

This feature is still experimental and is subject to change. Dealing with LLM requires accepting a level of fuzzyness and adjustments. Engineering is still very much a human endeavour!

Review a Python Web Application

In this scenario we take a very basic Python application, using the FastAPI and SQLAlchemy (sqlite) libraries. We want to learn what we can from this application.

  • Source code of the application

    app.py
    from fastapi import FastAPI, HTTPException, Depends
    from sqlalchemy import create_engine, Column, Integer, String
    from sqlalchemy.ext.declarative import declarative_base
    from sqlalchemy.orm import sessionmaker, Session
    from sqlalchemy.exc import SQLAlchemyError
    
    # Database configuration
    DATABASE_URL = "sqlite:///./test.db"
    engine = create_engine(DATABASE_URL)
    SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
    Base = declarative_base()
    
    # Define the User model
    class User(Base):
        __tablename__ = "users"
    
        id = Column(Integer, primary_key=True, index=True)
        name = Column(String, index=True)
        secret_password = Column(String)
    
    # Create the database tables
    Base.metadata.create_all(bind=engine)
    
    app = FastAPI(
        servers=[
            {"url": "http://localhost:9090", "description": "Staging environment"}
        ]
    )
    
    # Dependency to get the database session
    def get_db():
        db = SessionLocal()
        try:
            yield db
        finally:
            db.close()
    
    @app.get("/")
    async def read_root():
        return {"message": "Hello, World!"}
    
    @app.post("/users/")
    async def create_user(name: str, secret_password: str, db: sessionmaker[Session] = Depends(get_db)):
        try:
            db_user = User(name=name, secret_password=secret_password)
            db.add(db_user)
            db.commit()
            db.refresh(db_user)
            return db_user
        except SQLAlchemyError as e:
            db.rollback()
            raise HTTPException(status_code=500, detail="Database error occurred")
    
    @app.get("/users/{user_id}")
    async def read_user(user_id: int, db: sessionmaker[Session] = Depends(get_db)):
        try:
            user = db.query(User).filter(User.id == user_id).first()
            if user is None:
                raise HTTPException(status_code=404, detail="User not found")
            return user
        except SQLAlchemyError as e:
            raise HTTPException(status_code=500, detail="Database error occurred")
    

    You may now install the dependencies to run it:

    pip install fastapi sqlalchemy uvicorn
    
    uv tool install fastapi sqlalchemy uvicorn
    

    Finally, run the application as follows:

    fastapi dev --port 9090
    

    This application has only a couple of endpoints is purposefully not optimised.

  • Generate a scenario for this application

    We must first generate and run a scenario so we get a mapping of the application.

    lueur scenario generate --scenario scenario.yaml --spec-url http://localhost:9090/openapi.json
    
    Generate scenario

    The following scenario is created by lueur (we also trimmed it down to a single endpoint for clarity):

    title: Single high-latency spike (client ingress)
    description: A single 800ms spike simulates jitter buffer underrun / GC pause on client network stack.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: latency
        side: client
        mean: 800.0
        stddev: 100.0
        direction: ingress
        strategy: null
    expect:
        status: 200
    ---
    title: Stair-step latency growth (5 x 100 ms)
    description: Latency increases 100 ms per call; emulate slow congestion build-up or head-of-line blocking.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: latency
        side: client
        mean: 100.0
        stddev: 30.0
        direction: ingress
        strategy:
        mode: repeat
        step: 100.0
        count: 5
        add_baseline_call: true
    expect:
        status: 200
    ---
    title: Periodic 150-250 ms latency pulses during load
    description: Three latency bursts at 10-40-70% of a 10s window; good for P95 drift tracking.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: latency
        mean: 150.0
        period: start:10%,duration:15%
        - type: latency
        mean: 250.0
        period: start:40%,duration:15%
        - type: latency
        mean: 150.0
        period: start:70%,duration:15%
        strategy:
        mode: load
        duration: 10s
        clients: 3
        rps: 2
        slo:
        - slo_type: latency
        title: P95 < 300ms
        objective: 95.0
        threshold: 300.0
        - slo_type: error
        title: P99 < 1% errors
        objective: 99.0
        threshold: 1.0
    ---
    title: 5% packet loss for 4s
    description: Simulates flaky Wi-Fi or cellular interference.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        timeout: 500
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: packetloss
        direction: egress
        period: start:30%,duration:40%
        strategy: null
    expect:
        status: 200
        response_time_under: 100.0
    ---
    title: High jitter (±80ms @ 8Hz)
    description: Emulates bursty uplink, measuring buffering robustness.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: jitter
        amplitude: 80.0
        frequency: 8.0
        direction: ingress
        side: server
        strategy: null
    expect:
        status: 200
    ---
    title: 512 KBps bandwidth cap
    description: Models throttled 3G link; validates handling of large payloads.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: bandwidth
        rate: 512
        unit: KBps
        direction: ingress
        strategy:
        mode: load
        duration: 15s
        clients: 2
        rps: 1
    expect:
        status: 200
    ---
    title: Random 500 errors (5% of calls)
    description: Backend flakiness under load; ensures retry / circuit-breaker logic.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: httperror
        status_code: 500
        probability: 0.05
        strategy:
        mode: load
        duration: 8s
        clients: 5
        rps: 4
    expect:
        response_time_under: 100.0
    ---
    title: Full black-hole for 1 s
    description: Simulates router drop / Pod eviction causing 100% packet loss for a second.
    items:
    - call:
        method: GET
        url: http://localhost:9090/
        timeout: 500
        meta:
        operation_id: read_root__get
    context:
        upstreams:
        - http://localhost:9090/
        faults:
        - type: blackhole
        direction: egress
        period: start:45%,duration:10%
        strategy:
        mode: load
        duration: 10s
        clients: 2
        rps: 3
    
  • Run the scenario against this application

    lueur scenario run --scenario scenario.yaml
    
    ================ Running Scenarios ================
    
      1/1  Single high-latency spike (client ingress)   6/6  Stair-step latency growth (5 x 100 ms) ▮▮▮▮▮▮
      1/1  Periodic 150-250 ms latency pulses during load   1/1  5% packet loss for 4s   1/1  High jitter (±80ms @ 8Hz)   1/1  512 KBps bandwidth cap   1/1  Random 500 errors (5% of calls)   1/1  Full black-hole for 1 s 
    ===================== Summary =====================
    
    Tests run: 13, Tests failed: 0
    Total time: 45.3s
    
  • Analyze the generated results

    lueur agent advise --results results.json
    

    The generated report looks like this:

    Generated scenario analysis

    lueur resilience report analysis

    Table of Contents


    Executive Summary

    Findings

    • Our current SQLite setup uses default durability (DELETE journal mode) and a single-threaded pool, limiting concurrent reads/writes and resilience under load.
    • Blocking DB calls have no per-call timeouts, risking orphaned threads and full thread‐pool exhaustion.
    • Transient SQL errors surface directly as user‐facing failures—there’s no structured retry for DB operations.
    • We rely on a synchronous client (SQLAlchemy + sqlite3), which can exacerbate latency spikes and resource contention.

    Recommendations

    1. Enforce per-call timeouts on DB operations
      Cap query runtime (e.g. via asyncio.wait_for or driver‐level timeouts) to prevent slow queries from monopolizing threads.

    2. Enable WAL with synchronous=NORMAL & tune checkpointing
      Switch to Write-Ahead Logging for better reader-writer concurrency and schedule regular PRAGMA wal_checkpoint(TRUNCATE) to bound WAL growth.

    3. Add retry logic for transient DB failures
      Wrap all DB calls in tenacity-style exponential backoff (stop after ~3 attempts) to recover from brief locks or I/O hiccups.

    4. Migrate to a server-based or async DB client
      Evaluate PostgreSQL/MySQL or an asyncpg-based driver to offload locking and I/O, improve scalability, and reduce blocking.

    Key Trade-offs & Threats

    • Durability vs. Performance: synchronous=NORMAL may lose sub-second commits on crash.
    • Hidden Failures: Excessive retries can mask schema drift, full-disk, or logic bugs.
    • Premature Aborts: Timeouts might kill valid long-running operations and leak threads.
    • Migration Complexity: New DB server or async stack introduces operational overhead and potential misconfiguration.

    Next Steps & Validation

    • Fault Injection:
      Kill processes mid-commit to measure acceptable data-loss window.
    • Load & Chaos Testing:
      Simulate heavy write loads and inject SQLAlchemyError to tune timeouts and backoff intervals.
    • Monitoring & Alerts:
    • Track 504 Gateway Timeout rates and thread-pool queue length
    • Monitor WAL file size, checkpoint lag, and “database is locked” errors
    • Expose retry count, retry latency, and ultimate failure rates

    Implementing these changes and validating them through targeted tests will significantly improve throughput, reliability, and predictability—while maintaining clear rollback paths.

    Overall Resilience Posture

    The root endpoint proved highly resilient—handling single and ramped latency spikes, jitter, packet loss, bandwidth throttling, and injected HTTP-500 errors with zero expectation failures or status-code breaches. The only SLO miss was under periodic 150–250 ms latency pulses (P95 > 300 ms), and a 1 s black-hole produced a few timeouts, suggesting targeted timeout/retry tuning for latency-critical paths.

    SLO Failures Deep Dive

    In-depth look at scenarios breaching one or more SLOs.

    Scenario Endpoint SLO Violated Objective Observed Margin Failure Pattern
    Periodic 150–250 ms latency pulses during load GET / p95 < 300 ms 95% < 300 ms 603.34 ms +303.34 ms Tail latency uplift: 56/62 (90.3%) requests over threshold across all bursts
    Full black-hole for 1 s GET / error rate < 1% < 1% errors 4/62 (6.5%) +5.5 pp Failures concentrated during the 1 s outage (p95=500.55 ms, p99=500.81 ms)

    Dashboard Summary

    Scope Total Scenarios Passed Failed
    All Scenarios 8 6 2
    GET / 8 6 2

    Potential Cause Hypotheses

    Based on the periodic latency spikes and the 1 s black-hole on GET /, here are the most plausible, developer-actionable root causes

    1. Thread-pool saturation from sync DB calls in async routes

    2. Symptom mapping: tail-latency pulses on GET / (which is async, yet still experiences delay)

    3. Hypothesis: even though read_root() is an async def, other endpoints (/users/) execute blocking SQLAlchemy calls in the default thread-pool executor. Under load, these threads become fully occupied, starving Uvicorn’s worker threads and delaying all incoming requests—including the root route—for short bursts.
    4. SQLite file-lock contention during concurrent writes

    5. Symptom mapping: periodic 150–250 ms latency spikes and a single 1 s outage burst

    6. Hypothesis: the app uses a single SQLite file without WAL or proper connection pool sizing. When many POST /users/ commits hit the DB simultaneously, the underlying file lock serializes operations, causing some FastAPI workers to block until the lock is released. This serialization shows up as both tail-latency uplift and a black-hole when the lock persists.
    7. No per-call timeouts on blocking operations

    8. Symptom mapping: prolonged service unavailability for the duration of the lock (≈1 s) rather than fast failure

    9. Hypothesis: neither the SQLAlchemy engine nor the blocking calls in endpoints have explicit timeouts. When a write or commit stalls on I/O (e.g., lock contention or fsync), the request hangs indefinitely (up to the HTTP client timeout), magnifying the outage window and violating the p95/p99 latency SLOs.

    Recommendations

    Actionable proposals to address DB thread‐pool saturation, SQLite lock contention, and lack of timeouts

    Below are four targeted recommendation sets. Each includes PR-style diffs, priority levels, and a summary table to help you weigh cost, complexity, and expected benefits.


    1. Enforce Per-Call Timeouts on Blocking DB Operations

    Priority: Critical
    Rationale: Bounding each synchronous DB call prevents unbounded growth of the thread pool, protects the event loop under spikes, and converts “black-hole” waits into fast failovers.

    Proposed Changes

    --- a/app/main.py
    +++ b/app/main.py
     import asyncio
     from functools import partial
     from fastapi import FastAPI, HTTPException, Depends
     from sqlalchemy.exc import SQLAlchemyError
    +from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
    
     @app.post("/users/")
     async def create_user(
         name: str,
         secret_password: str,
         db: Session = Depends(get_db),
     ):
    -    try:
    -        db_user = User(name=name, secret_password=secret_password)
    -        db.add(db_user)
    -        db.commit()
    -        db.refresh(db_user)
    -        return db_user
    -    except SQLAlchemyError:
    -        db.rollback()
    -        raise HTTPException(status_code=500, detail="Database error occurred")
    +    # Run the blocking DB write in a thread with a 5s timeout
    +    task = asyncio.get_event_loop().run_in_executor(
    +        None, partial(_sync_create_user, db, name, secret_password)
    +    )
    +    try:
    +        return await asyncio.wait_for(task, timeout=5.0)
    +    except asyncio.TimeoutError:
    +        # Abort long‐running or stuck operations
    +        raise HTTPException(status_code=504, detail="Database operation timed out")
    +    except SQLAlchemyError:
    +        db.rollback()
    +        raise HTTPException(status_code=500, detail="Database error occurred")
    
    +def _sync_create_user(db: Session, name: str, secret_password: str) -> User:
    +    """Helper: performs sync DB logic."""
    +    user = User(name=name, secret_password=secret_password)
    +    db.add(user)
    +    db.commit()
    +    db.refresh(user)
    +    return user
    

    Discussion:

    • asyncio.wait_for bounds each DB interaction to 5 seconds.
    • A timeout surfaces as a 504, preventing head‐of‐line blocking.
    • Moves sync logic into a dedicated helper (_sync_create_user).

    2. Enable WAL & Tune SQLite Connection Parameters

    Priority: Recommended
    Rationale: Write–read lock contention is drastically reduced with WAL mode and tuned timeouts, minimizing serialized commits under concurrency.

    Proposed Changes

    --- a/app/db.py
    +++ b/app/db.py
    -from sqlalchemy import create_engine
    +from sqlalchemy import create_engine, event
     from sqlalchemy.orm import sessionmaker
     from sqlalchemy.pool import SingletonThreadPool
    
     DATABASE_URL = "sqlite:///./test.db"
    -engine = create_engine(DATABASE_URL)
    +engine = create_engine(
    +    DATABASE_URL,
    +    connect_args={
    +        "timeout": 15,         # max wait for file lock
    +        "check_same_thread": False,
    +    },
    +    poolclass=SingletonThreadPool,  # serialize SQLite access
    +)
    
     # Enable WAL + relaxed sync on each new connection
     @event.listens_for(engine, "connect")
     def _set_sqlite_pragma(dbapi_conn, record):
         cursor = dbapi_conn.cursor()
         cursor.execute("PRAGMA journal_mode=WAL;")
         cursor.execute("PRAGMA synchronous=NORMAL;")
         cursor.close()
    
     SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
    

    Discussion:

    • timeout back-off reduces immediate “database is locked” errors.
    • SingletonThreadPool serializes access, preventing pool thrashing.
    • WAL journal and synchronous=NORMAL yield faster commits with minimal data‐loss risk.

    3. Add Retry Logic for Transient DB Failures

    Priority: Recommended
    Rationale: Brief file‐system or lock‐related errors surface as 500s; automatic retries with exponential back-off recover many of these without user impact.

    Proposed Changes

    --- a/app/crud.py
    +++ b/app/crud.py
     from sqlalchemy.exc import SQLAlchemyError
    +from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
    
     @retry(
         retry=retry_if_exception_type(SQLAlchemyError),
         wait=wait_exponential(multiplier=0.5, max=2.0),
         stop=stop_after_attempt(3),
         reraise=True,
     )
     def _sync_create_user(db: Session, name: str, secret_password: str) -> User:
         user = User(name=name, secret_password=secret_password)
         db.add(user)
         db.commit()
         db.refresh(user)
         return user
    
     # no changes to the FastAPI endpoint signature;
     # retry is applied transparently in the helper
    

    Discussion:

    • Tenacity retries up to 3 times with exponential back-off on any SQLAlchemyError.
    • Many brief lock‐contention errors auto-resolve before surfacing a 500.

    4. Long-Term: Migrate to Server-Based or Async DB Client

    Priority: Nice-to-have
    Rationale: Eliminates SQLite’s file‐lock limitations and true async I/O avoids thread‐pool blockers. Consider Postgres/MySQL or an asyncpg/databases stack for high‐throughput writes.

    Example Snippet (using databases)

    # install: pip install databases asyncpg
    
    import databases
    import sqlalchemy
    
    DATABASE_URL = "postgresql://user:pass@localhost:5432/appdb"
    database = databases.Database(DATABASE_URL)
    metadata = sqlalchemy.MetaData()
    
    users = sqlalchemy.Table(
        "users", metadata,
        sqlalchemy.Column("id", sqlalchemy.Integer, primary_key=True),
        sqlalchemy.Column("name", sqlalchemy.String),
        sqlalchemy.Column("secret_password", sqlalchemy.String),
    )
    
    engine = sqlalchemy.create_engine(DATABASE_URL)
    metadata.create_all(engine)
    
    app = FastAPI()
    
    @app.on_event("startup")
    async def startup():
        await database.connect()
    
    @app.on_event("shutdown")
    async def shutdown():
        await database.disconnect()
    
    @app.post("/users/")
    async def create_user(name: str, secret_password: str):
        query = users.insert().values(name=name, secret_password=secret_password)
        user_id = await database.execute(query)
        return { "id": user_id, "name": name }
    

    Summary & Prioritization Table

    Recommendation Priority Complexity Implementation Cost Expected Benefit
    1. Timeout per-call DB ops Critical Medium Medium Prevent thread‐pool exhaustion, fast failure
    2. Enable WAL & tune SQLite (timeout, poolclass) Recommended Low Low Reduced lock waits, fewer p99 spikes
    3. Add retry logic via Tenacity Recommended Low Low Higher success rate under transient contention
    4. Migrate to server-based or async DB client Nice-to-have High Medium–High True concurrency, no SQLite lock contention

    Threats & Next Steps

    Outline of business risks, manifestation scenarios, monitoring metrics, and validation steps for each DB hardening recommendation

    Recommendation Potential Business Risk How It Can Materialize Monitoring & Validation
    1. Enforce per-call timeouts on blocking DB operations • User-facing 504s during traffic spikes
    • Orphaned threads hogging the pool, causing broader slowdowns
    • Legitimate bulk writes exceed 5 s → automatic abort → 504s
    • Threads stuck in OS I/O linger indefinitely
    • Track 504 Gateway Timeout rate
    • Instrument thread-pool queue length and active threads (APM, psutil)
    • Run load tests with slow queries to tune timeout
    2. Enable WAL & tune SQLite connection parameters • Looser durability (lost last-millisecond writes on crash)
    • WAL file growth filling disk
    • Latency spikes under heavy writes
    • Power loss before WAL checkpoint flush drops recent transactions
    • WAL > disk quota → DB write errors
    • Many concurrent writers queue up
    • Monitor WAL file size and checkpoint frequency (PRAGMA wal_checkpoint(TRUNCATE))
    • Alert on “database is locked” or disk-full errors
    • Measure p50/p99 write latencies under load
    3. Add retry logic for transient DB failures • Genuine errors masked (schema drift, full disk)
    • Increased CPU/I/O during systemic failures, raising costs
    • Continuous retries on misconfiguration stall service
    • Retry storms amplify outage impact
    • Expose metrics: retry count, retry latency, ultimate failures
    • Alert if retries > 5% of requests
    • Inject transient SQLAlchemyError in staging to validate back-off behavior
    4. Migrate to server-based or async DB client • Migration complexity leading to production incidents
    • New single point of failure or network latency issues
    • Botched data migration causes downtime
    • Async client misconfiguration leaks connections under load
    • Canary rollout with shadow reads/writes
    • Run end-to-end latency benchmarks vs. current baseline
    • Build rollback scripts and test in staging

    Detailed Threats & Next Steps

    1. Enforce per-call timeouts on blocking DB operations
      • Business risk/trade-off
      – Premature 504s degrade user experience for valid but slow operations
      – Orphaned threads may accumulate if underlying calls ignore cancellation
      • Next steps / tests
      – Load-test with varying query complexities to find optimal timeout
      – Monitor thread pool metrics via APM or psutil
      – Implement safeguards to kill or recycle stuck threads after timeout

    2. Enable WAL & tune SQLite connection parameters
      • Business risk/trade-off
      synchronous=NORMAL may lose last-written transactions on crash
      – WAL file growth can fill storage if checkpoints lag
      • Next steps / tests
      – Schedule periodic checkpoints and monitor WAL size
      – Simulate power-failure scenarios to measure acceptable data loss window
      – Generate concurrent writes (e.g., 100× clients) and chart p50/p99 latencies

    3. Add retry logic for transient DB failures
      • Business risk/trade-off
      – Retries may hide underlying bugs or data corruption issues
      – Excessive retries increase load during broader outages
      • Next steps / tests
      – Expose custom metrics for retry attempts, back-off intervals, and ultimate failures
      – Alert when retry rate or cumulative retry latency exceeds threshold
      – Conduct chaos tests by injecting SQLAlchemyError and verify exponential back-off

    4. Migrate to server-based or async DB client
      • Business risk/trade-off
      – Migration introduces complexity: schema conversion, connection pooling, error handling changes
      – New async stack adds a learning curve and potential misconfigurations
      • Next steps / tests
      – Perform canary deployments with partial traffic routing
      – Validate data consistency with shadow reads/writes against existing SQLite
      – Benchmark end-to-end request latency and scale tests before full cutover

    By continuously tracking these metrics, running targeted failure scenarios in staging, and maintaining clear rollback paths, you can ensure each mitigation delivers real resilience improvements without introducing new business risks.


    Generated on 2025-05-09 12:51:53.290886837 UTC