Scenario Analysis
This guide will show you how to analyze scenario results, from an angle of resilience and reliability, using LLM.
The analysis aims at giving you a sound report of potential issues, threats and remediations to consider for your application.
Prerequisites
-
Install lueur
If you haven’t installed Lueur yet, follow the installation instructions.
Windows not supported
Unfortunately, the agent feature is not supported on Windows because the framework used by lueur to interact with LLM does not support that platform.
Experimental feature
This feature is still experimental and is subject to change. Dealing with LLM requires accepting a level of fuzzyness and adjustments. Engineering is still very much a human endeavour!
Review a Python Web Application
In this scenario we take a very basic Python application, using the FastAPI and SQLAlchemy (sqlite) libraries. We want to learn what we can from this application.
-
Source code of the application
app.pyfrom fastapi import FastAPI, HTTPException, Depends from sqlalchemy import create_engine, Column, Integer, String from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker, Session from sqlalchemy.exc import SQLAlchemyError # Database configuration DATABASE_URL = "sqlite:///./test.db" engine = create_engine(DATABASE_URL) SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine) Base = declarative_base() # Define the User model class User(Base): __tablename__ = "users" id = Column(Integer, primary_key=True, index=True) name = Column(String, index=True) secret_password = Column(String) # Create the database tables Base.metadata.create_all(bind=engine) app = FastAPI( servers=[ {"url": "http://localhost:9090", "description": "Staging environment"} ] ) # Dependency to get the database session def get_db(): db = SessionLocal() try: yield db finally: db.close() @app.get("/") async def read_root(): return {"message": "Hello, World!"} @app.post("/users/") async def create_user(name: str, secret_password: str, db: sessionmaker[Session] = Depends(get_db)): try: db_user = User(name=name, secret_password=secret_password) db.add(db_user) db.commit() db.refresh(db_user) return db_user except SQLAlchemyError as e: db.rollback() raise HTTPException(status_code=500, detail="Database error occurred") @app.get("/users/{user_id}") async def read_user(user_id: int, db: sessionmaker[Session] = Depends(get_db)): try: user = db.query(User).filter(User.id == user_id).first() if user is None: raise HTTPException(status_code=404, detail="User not found") return user except SQLAlchemyError as e: raise HTTPException(status_code=500, detail="Database error occurred")
You may now install the dependencies to run it:
Finally, run the application as follows:
This application has only a couple of endpoints is purposefully not optimised.
-
Generate a scenario for this application
We must first generate and run a scenario so we get a mapping of the application.
Generate scenario
The following scenario is created by lueur (we also trimmed it down to a single endpoint for clarity):
title: Single high-latency spike (client ingress) description: A single 800ms spike simulates jitter buffer underrun / GC pause on client network stack. items: - call: method: GET url: http://localhost:9090/ meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: latency side: client mean: 800.0 stddev: 100.0 direction: ingress strategy: null expect: status: 200 --- title: Stair-step latency growth (5 x 100 ms) description: Latency increases 100 ms per call; emulate slow congestion build-up or head-of-line blocking. items: - call: method: GET url: http://localhost:9090/ meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: latency side: client mean: 100.0 stddev: 30.0 direction: ingress strategy: mode: repeat step: 100.0 count: 5 add_baseline_call: true expect: status: 200 --- title: Periodic 150-250 ms latency pulses during load description: Three latency bursts at 10-40-70% of a 10s window; good for P95 drift tracking. items: - call: method: GET url: http://localhost:9090/ meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: latency mean: 150.0 period: start:10%,duration:15% - type: latency mean: 250.0 period: start:40%,duration:15% - type: latency mean: 150.0 period: start:70%,duration:15% strategy: mode: load duration: 10s clients: 3 rps: 2 slo: - slo_type: latency title: P95 < 300ms objective: 95.0 threshold: 300.0 - slo_type: error title: P99 < 1% errors objective: 99.0 threshold: 1.0 --- title: 5% packet loss for 4s description: Simulates flaky Wi-Fi or cellular interference. items: - call: method: GET url: http://localhost:9090/ timeout: 500 meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: packetloss direction: egress period: start:30%,duration:40% strategy: null expect: status: 200 response_time_under: 100.0 --- title: High jitter (±80ms @ 8Hz) description: Emulates bursty uplink, measuring buffering robustness. items: - call: method: GET url: http://localhost:9090/ meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: jitter amplitude: 80.0 frequency: 8.0 direction: ingress side: server strategy: null expect: status: 200 --- title: 512 KBps bandwidth cap description: Models throttled 3G link; validates handling of large payloads. items: - call: method: GET url: http://localhost:9090/ meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: bandwidth rate: 512 unit: KBps direction: ingress strategy: mode: load duration: 15s clients: 2 rps: 1 expect: status: 200 --- title: Random 500 errors (5% of calls) description: Backend flakiness under load; ensures retry / circuit-breaker logic. items: - call: method: GET url: http://localhost:9090/ meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: httperror status_code: 500 probability: 0.05 strategy: mode: load duration: 8s clients: 5 rps: 4 expect: response_time_under: 100.0 --- title: Full black-hole for 1 s description: Simulates router drop / Pod eviction causing 100% packet loss for a second. items: - call: method: GET url: http://localhost:9090/ timeout: 500 meta: operation_id: read_root__get context: upstreams: - http://localhost:9090/ faults: - type: blackhole direction: egress period: start:45%,duration:10% strategy: mode: load duration: 10s clients: 2 rps: 3
-
Run the scenario against this application
lueur scenario run --scenario scenario.yaml ================ Running Scenarios ================ ⠏ 1/1 Single high-latency spike (client ingress) ▮ ⠏ 6/6 Stair-step latency growth (5 x 100 ms) ▮▮▮▮▮▮ ⠏ 1/1 Periodic 150-250 ms latency pulses during load ▮ ⠏ 1/1 5% packet loss for 4s ▮ ⠏ 1/1 High jitter (±80ms @ 8Hz) ▮ ⠏ 1/1 512 KBps bandwidth cap ▮ ⠏ 1/1 Random 500 errors (5% of calls) ▮ ⠙ 1/1 Full black-hole for 1 s ▮ ===================== Summary ===================== Tests run: 13, Tests failed: 0 Total time: 45.3s
-
Analyze the generated results
The generated report looks like this:
Generated scenario analysis
lueur resilience report analysis
Table of Contents
- Overall Resilience Posture
- SLO Failures Deep Dive
- Potential Cause Hypotheses
- Recommendations
- 1. Enforce Per-Call Timeouts on Blocking DB Operations
- 2. Enable WAL & Tune SQLite Connection Parameters
- 3. Add Retry Logic for Transient DB Failures
- 4. Long-Term: Migrate to Server-Based or Async DB Client
- Summary & Prioritization Table
- Threats & Next Steps
- Detailed Threats & Next Steps
Executive Summary
Findings
- Our current SQLite setup uses default durability (
DELETE
journal mode) and a single-threaded pool, limiting concurrent reads/writes and resilience under load. - Blocking DB calls have no per-call timeouts, risking orphaned threads and full thread‐pool exhaustion.
- Transient SQL errors surface directly as user‐facing failures—there’s no structured retry for DB operations.
- We rely on a synchronous client (
SQLAlchemy
+sqlite3
), which can exacerbate latency spikes and resource contention.
Recommendations
-
Enforce per-call timeouts on DB operations
Cap query runtime (e.g. viaasyncio.wait_for
or driver‐level timeouts) to prevent slow queries from monopolizing threads. -
Enable WAL with
synchronous=NORMAL
& tune checkpointing
Switch to Write-Ahead Logging for better reader-writer concurrency and schedule regularPRAGMA wal_checkpoint(TRUNCATE)
to bound WAL growth. -
Add retry logic for transient DB failures
Wrap all DB calls intenacity
-style exponential backoff (stop after ~3 attempts) to recover from brief locks or I/O hiccups. -
Migrate to a server-based or async DB client
Evaluate PostgreSQL/MySQL or anasyncpg
-based driver to offload locking and I/O, improve scalability, and reduce blocking.
Key Trade-offs & Threats
- Durability vs. Performance:
synchronous=NORMAL
may lose sub-second commits on crash. - Hidden Failures: Excessive retries can mask schema drift, full-disk, or logic bugs.
- Premature Aborts: Timeouts might kill valid long-running operations and leak threads.
- Migration Complexity: New DB server or async stack introduces operational overhead and potential misconfiguration.
Next Steps & Validation
- Fault Injection:
Kill processes mid-commit to measure acceptable data-loss window. - Load & Chaos Testing:
Simulate heavy write loads and injectSQLAlchemyError
to tune timeouts and backoff intervals. - Monitoring & Alerts:
- Track 504 Gateway Timeout rates and thread-pool queue length
- Monitor WAL file size, checkpoint lag, and “database is locked” errors
- Expose retry count, retry latency, and ultimate failure rates
Implementing these changes and validating them through targeted tests will significantly improve throughput, reliability, and predictability—while maintaining clear rollback paths.
Overall Resilience Posture
The root endpoint proved highly resilient—handling single and ramped latency spikes, jitter, packet loss, bandwidth throttling, and injected HTTP-500 errors with zero expectation failures or status-code breaches. The only SLO miss was under periodic 150–250 ms latency pulses (P95 > 300 ms), and a 1 s black-hole produced a few timeouts, suggesting targeted timeout/retry tuning for latency-critical paths.
SLO Failures Deep Dive
In-depth look at scenarios breaching one or more SLOs.
Scenario Endpoint SLO Violated Objective Observed Margin Failure Pattern Periodic 150–250 ms latency pulses during load GET /
p95 < 300 ms
95% < 300 ms
603.34 ms
+303.34 ms
Tail latency uplift: 56/62 (90.3%) requests over threshold across all bursts Full black-hole for 1 s GET /
error rate < 1%
< 1% errors
4/62 (6.5%)
+5.5 pp
Failures concentrated during the 1 s outage (p95=500.55 ms, p99=500.81 ms) Dashboard Summary
Scope Total Scenarios Passed Failed All Scenarios 8 6 2 • GET /
8 6 2 Potential Cause Hypotheses
Based on the periodic latency spikes and the 1 s black-hole on
GET /
, here are the most plausible, developer-actionable root causes-
Thread-pool saturation from sync DB calls in async routes
-
Symptom mapping: tail-latency pulses on
GET /
(which is async, yet still experiences delay) - Hypothesis: even though
read_root()
is anasync def
, other endpoints (/users/
) execute blocking SQLAlchemy calls in the default thread-pool executor. Under load, these threads become fully occupied, starving Uvicorn’s worker threads and delaying all incoming requests—including the root route—for short bursts. -
SQLite file-lock contention during concurrent writes
-
Symptom mapping: periodic 150–250 ms latency spikes and a single 1 s outage burst
- Hypothesis: the app uses a single SQLite file without WAL or proper connection pool sizing. When many
POST /users/
commits hit the DB simultaneously, the underlying file lock serializes operations, causing some FastAPI workers to block until the lock is released. This serialization shows up as both tail-latency uplift and a black-hole when the lock persists. -
No per-call timeouts on blocking operations
-
Symptom mapping: prolonged service unavailability for the duration of the lock (≈1 s) rather than fast failure
- Hypothesis: neither the SQLAlchemy engine nor the blocking calls in endpoints have explicit timeouts. When a write or commit stalls on I/O (e.g., lock contention or fsync), the request hangs indefinitely (up to the HTTP client timeout), magnifying the outage window and violating the p95/p99 latency SLOs.
Recommendations
Actionable proposals to address DB thread‐pool saturation, SQLite lock contention, and lack of timeouts
Below are four targeted recommendation sets. Each includes PR-style diffs, priority levels, and a summary table to help you weigh cost, complexity, and expected benefits.
1. Enforce Per-Call Timeouts on Blocking DB Operations
Priority: Critical
Rationale: Bounding each synchronous DB call prevents unbounded growth of the thread pool, protects the event loop under spikes, and converts “black-hole” waits into fast failovers.Proposed Changes
--- a/app/main.py +++ b/app/main.py import asyncio from functools import partial from fastapi import FastAPI, HTTPException, Depends from sqlalchemy.exc import SQLAlchemyError +from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type @app.post("/users/") async def create_user( name: str, secret_password: str, db: Session = Depends(get_db), ): - try: - db_user = User(name=name, secret_password=secret_password) - db.add(db_user) - db.commit() - db.refresh(db_user) - return db_user - except SQLAlchemyError: - db.rollback() - raise HTTPException(status_code=500, detail="Database error occurred") + # Run the blocking DB write in a thread with a 5s timeout + task = asyncio.get_event_loop().run_in_executor( + None, partial(_sync_create_user, db, name, secret_password) + ) + try: + return await asyncio.wait_for(task, timeout=5.0) + except asyncio.TimeoutError: + # Abort long‐running or stuck operations + raise HTTPException(status_code=504, detail="Database operation timed out") + except SQLAlchemyError: + db.rollback() + raise HTTPException(status_code=500, detail="Database error occurred") +def _sync_create_user(db: Session, name: str, secret_password: str) -> User: + """Helper: performs sync DB logic.""" + user = User(name=name, secret_password=secret_password) + db.add(user) + db.commit() + db.refresh(user) + return user
Discussion:
asyncio.wait_for
bounds each DB interaction to 5 seconds.- A timeout surfaces as a 504, preventing head‐of‐line blocking.
- Moves sync logic into a dedicated helper (
_sync_create_user
).
2. Enable WAL & Tune SQLite Connection Parameters
Priority: Recommended
Rationale: Write–read lock contention is drastically reduced with WAL mode and tuned timeouts, minimizing serialized commits under concurrency.Proposed Changes
--- a/app/db.py +++ b/app/db.py -from sqlalchemy import create_engine +from sqlalchemy import create_engine, event from sqlalchemy.orm import sessionmaker from sqlalchemy.pool import SingletonThreadPool DATABASE_URL = "sqlite:///./test.db" -engine = create_engine(DATABASE_URL) +engine = create_engine( + DATABASE_URL, + connect_args={ + "timeout": 15, # max wait for file lock + "check_same_thread": False, + }, + poolclass=SingletonThreadPool, # serialize SQLite access +) # Enable WAL + relaxed sync on each new connection @event.listens_for(engine, "connect") def _set_sqlite_pragma(dbapi_conn, record): cursor = dbapi_conn.cursor() cursor.execute("PRAGMA journal_mode=WAL;") cursor.execute("PRAGMA synchronous=NORMAL;") cursor.close() SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Discussion:
timeout
back-off reduces immediate “database is locked” errors.SingletonThreadPool
serializes access, preventing pool thrashing.- WAL journal and
synchronous=NORMAL
yield faster commits with minimal data‐loss risk.
3. Add Retry Logic for Transient DB Failures
Priority: Recommended
Rationale: Brief file‐system or lock‐related errors surface as 500s; automatic retries with exponential back-off recover many of these without user impact.Proposed Changes
--- a/app/crud.py +++ b/app/crud.py from sqlalchemy.exc import SQLAlchemyError +from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type @retry( retry=retry_if_exception_type(SQLAlchemyError), wait=wait_exponential(multiplier=0.5, max=2.0), stop=stop_after_attempt(3), reraise=True, ) def _sync_create_user(db: Session, name: str, secret_password: str) -> User: user = User(name=name, secret_password=secret_password) db.add(user) db.commit() db.refresh(user) return user # no changes to the FastAPI endpoint signature; # retry is applied transparently in the helper
Discussion:
- Tenacity retries up to 3 times with exponential back-off on any
SQLAlchemyError
. - Many brief lock‐contention errors auto-resolve before surfacing a 500.
4. Long-Term: Migrate to Server-Based or Async DB Client
Priority: Nice-to-have
Rationale: Eliminates SQLite’s file‐lock limitations and true async I/O avoids thread‐pool blockers. Consider Postgres/MySQL or anasyncpg
/databases
stack for high‐throughput writes.Example Snippet (using
databases
)# install: pip install databases asyncpg import databases import sqlalchemy DATABASE_URL = "postgresql://user:pass@localhost:5432/appdb" database = databases.Database(DATABASE_URL) metadata = sqlalchemy.MetaData() users = sqlalchemy.Table( "users", metadata, sqlalchemy.Column("id", sqlalchemy.Integer, primary_key=True), sqlalchemy.Column("name", sqlalchemy.String), sqlalchemy.Column("secret_password", sqlalchemy.String), ) engine = sqlalchemy.create_engine(DATABASE_URL) metadata.create_all(engine) app = FastAPI() @app.on_event("startup") async def startup(): await database.connect() @app.on_event("shutdown") async def shutdown(): await database.disconnect() @app.post("/users/") async def create_user(name: str, secret_password: str): query = users.insert().values(name=name, secret_password=secret_password) user_id = await database.execute(query) return { "id": user_id, "name": name }
Summary & Prioritization Table
Recommendation Priority Complexity Implementation Cost Expected Benefit 1. Timeout per-call DB ops Critical Medium Medium Prevent thread‐pool exhaustion, fast failure 2. Enable WAL & tune SQLite ( timeout
, poolclass)Recommended Low Low Reduced lock waits, fewer p99 spikes 3. Add retry logic via Tenacity Recommended Low Low Higher success rate under transient contention 4. Migrate to server-based or async DB client Nice-to-have High Medium–High True concurrency, no SQLite lock contention Threats & Next Steps
Outline of business risks, manifestation scenarios, monitoring metrics, and validation steps for each DB hardening recommendation
Recommendation Potential Business Risk How It Can Materialize Monitoring & Validation 1. Enforce per-call timeouts on blocking DB operations • User-facing 504s during traffic spikes
• Orphaned threads hogging the pool, causing broader slowdowns• Legitimate bulk writes exceed 5 s → automatic abort → 504s
• Threads stuck in OS I/O linger indefinitely• Track 504 Gateway Timeout rate
• Instrument thread-pool queue length and active threads (APM,psutil
)
• Run load tests with slow queries to tune timeout2. Enable WAL & tune SQLite connection parameters • Looser durability (lost last-millisecond writes on crash)
• WAL file growth filling disk
• Latency spikes under heavy writes• Power loss before WAL checkpoint flush drops recent transactions
• WAL > disk quota → DB write errors
• Many concurrent writers queue up• Monitor WAL file size and checkpoint frequency ( PRAGMA wal_checkpoint(TRUNCATE)
)
• Alert on “database is locked” or disk-full errors
• Measure p50/p99 write latencies under load3. Add retry logic for transient DB failures • Genuine errors masked (schema drift, full disk)
• Increased CPU/I/O during systemic failures, raising costs• Continuous retries on misconfiguration stall service
• Retry storms amplify outage impact• Expose metrics: retry count, retry latency, ultimate failures
• Alert if retries > 5% of requests
• Inject transientSQLAlchemyError
in staging to validate back-off behavior4. Migrate to server-based or async DB client • Migration complexity leading to production incidents
• New single point of failure or network latency issues• Botched data migration causes downtime
• Async client misconfiguration leaks connections under load• Canary rollout with shadow reads/writes
• Run end-to-end latency benchmarks vs. current baseline
• Build rollback scripts and test in staging
Detailed Threats & Next Steps
-
Enforce per-call timeouts on blocking DB operations
• Business risk/trade-off
– Premature 504s degrade user experience for valid but slow operations
– Orphaned threads may accumulate if underlying calls ignore cancellation
• Next steps / tests
– Load-test with varying query complexities to find optimal timeout
– Monitor thread pool metrics via APM orpsutil
– Implement safeguards to kill or recycle stuck threads after timeout -
Enable WAL & tune SQLite connection parameters
• Business risk/trade-off
–synchronous=NORMAL
may lose last-written transactions on crash
– WAL file growth can fill storage if checkpoints lag
• Next steps / tests
– Schedule periodic checkpoints and monitor WAL size
– Simulate power-failure scenarios to measure acceptable data loss window
– Generate concurrent writes (e.g., 100× clients) and chart p50/p99 latencies -
Add retry logic for transient DB failures
• Business risk/trade-off
– Retries may hide underlying bugs or data corruption issues
– Excessive retries increase load during broader outages
• Next steps / tests
– Expose custom metrics for retry attempts, back-off intervals, and ultimate failures
– Alert when retry rate or cumulative retry latency exceeds threshold
– Conduct chaos tests by injectingSQLAlchemyError
and verify exponential back-off -
Migrate to server-based or async DB client
• Business risk/trade-off
– Migration introduces complexity: schema conversion, connection pooling, error handling changes
– New async stack adds a learning curve and potential misconfigurations
• Next steps / tests
– Perform canary deployments with partial traffic routing
– Validate data consistency with shadow reads/writes against existing SQLite
– Benchmark end-to-end request latency and scale tests before full cutover
By continuously tracking these metrics, running targeted failure scenarios in staging, and maintaining clear rollback paths, you can ensure each mitigation delivers real resilience improvements without introducing new business risks.
Generated on 2025-05-09 12:51:53.290886837 UTC