Skip to content

fault is an Open-Source Rust-powered CLI and AI-MCP server that lets engineers and AI agents team up to ship production-ready code.

Plug fault into your favorite AI-Editor

  • | | | |
    fault agent tool fault_score_reliability
                            
    {
      "explanation": "The function has no error handling, timeout, retry or fallback logic, making it prone to upstream failures causing unhandled exceptions.",
      "score": 0.2
    }
                          
                        
    fault agent code-review
                            Here’s a small, focused refactor of your `/` handler that will immediately improve resilience, observability and error‐handling without drastically changing your app:
    
    
    1. **Single, long-lived HTTP client**  
       Re-using an `httpx.Client` gives you connection pooling and lets you centrally configure timeouts.
    
    2. **Per-request timeouts**  
       By setting `timeout=httpx.Timeout(connect=5, read=10)` we fail fast on network issues.
    
    3. **Retries with exponential backoff**  
       The [`backoff`](https://pypi.org/project/backoff/) library wraps `_fetch_todo()` so that transient errors and 5xx status codes automatically retry (up to 5× or ~30 s total).
    
    4. **Error propagation & HTTPExceptions**  
       We catch `RequestError` (network) vs `HTTPStatusError` (non-2xx) and translate them into appropriate 5xx/4xx FastAPI responses.
    
    5. **Logging**  
       Warnings on upstream 4xxs and errors on network failures let you trace the failure mode in your logs.
    
    ## Next steps
    
    - **Metrics & tracing**:  
      Plug in Prometheus counters or OpenTelemetry spans around `_fetch_todo` to observe retry counts, latencies, and success/fail rates.  
    - **Circuit breaker**:  
      If your upstream is flapping, consider a circuit-breaker pattern (e.g. [`pybreaker`](https://pypi.org/project/pybreaker/)) to avoid overwhelming it.  
    - **Async support**:  
      If you expect high concurrency, convert to `httpx.AsyncClient` and make `index` an `async def`.  
    
    These minimal changes will immediately make your endpoint far more reliable under network glitches or upstream outages—yet remain straightforward to slot into your existing `app.py`.
                          
    fault agent tool fault_make_reliability_and_perf_changes
                            --- app.py
    +++ app.py
    @@ -1,15 +1,58 @@
     import os
    +import logging
     
     import httpx
    -from fastapi import FastAPI
    +from fastapi import FastAPI, HTTPException
     
     UPSTREAM_URL = os.getenv("UPSTREAM_URL", "https://jsonplaceholder.typicode.com")
    +REQUEST_TIMEOUT = float(os.getenv("REQUEST_TIMEOUT", 5.0))
    +MAX_RETRIES = int(os.getenv("MAX_RETRIES", 3))
    +BACKOFF_FACTOR = float(os.getenv("BACKOFF_FACTOR", 0.5))
    +RETRY_STATUS_CODES = [502, 503, 504]
     
     app = FastAPI()
    +logger = logging.getLogger("uvicorn.error")
     
    +transport = httpx.RetryTransport(
    +    retries=MAX_RETRIES,
    +    backoff_factor=BACKOFF_FACTOR,
    +    status_forcelist=RETRY_STATUS_CODES,
    +    allowed_methods=["GET"],
    +)
     
    +@app.on_event("startup")
    +async def startup_event():
    +    '''
    +    Initialize the HTTPX client with retry and timeout settings for resilience.
    +    '''
    +    app.state.client = httpx.AsyncClient(
    +        base_url=UPSTREAM_URL,
    +        timeout=httpx.Timeout(REQUEST_TIMEOUT),
    +        transport=transport,
    +        headers={"Host": "jsonplaceholder.typicode.com"},
    +    )
    +
    +@app.on_event("shutdown")
    +async def shutdown_event():
    +    '''
    +    Close the HTTPX client on application shutdown to release resources.
    +    '''
    +    await app.state.client.aclose()
    +
     @app.get("/")
    -def index():
    -    return httpx.get(f"{UPSTREAM_URL}/todos/1", headers={
    -        "Host": "jsonplaceholder.typicode.com"
    -    }).json()
    +async def index():
    +    '''
    +    Retrieves a todo item from the upstream service.
    +    Implements retry logic, timeouts, and error handling for resilience.
    +    '''
    +    try:
    +        response = await app.state.client.get("/todos/1")
    +        response.raise_for_status()
    +    except httpx.RequestError as exc:
    +        logger.error(f"Request error: {exc}")
    +        raise HTTPException(status_code=503, detail="Upstream service unavailable")
    +    except httpx.HTTPStatusError as exc:
    +        logger.error(f"Invalid response {exc.response.status_code}: {exc}")
    +        raise HTTPException(status_code=502, detail="Upstream returned an error")
    +
    +    return response.json()
    
                            
                          
    fault agent tool fault_suggest_service_level_objectives_slo
                            [
      {
        "type": "latency",
        "title": "P95 Latency for /todos/1 API",
        "objective": 95.0,
        "explanation": "95th percentile of request latency must stay under threshold to ensure responsiveness",
        "threshold": 0.3,
        "unit": "s",
        "window": "7d",
        "sli": {
          "prometheus": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{handler=\"/todos/1\"}[5m])) by (le)) * 1000",
          "gcp/cloudrun": {
            "displayName": "95% - P95 Latency - Calendar day",
            "goal": 0.95,
            "calendarPeriod": "DAY",
            "serviceLevelIndicator": {
              "windowsBased": {
                "windowPeriod": "300s",
                "goodTotalRatioThreshold": {
                  "basicSliPerformance": {
                    "latency": {
                      "threshold": "0.3s"
                    }
                  },
                  "threshold": 0.95
                }
              }
            }
          }
        }
      },
      {
        "type": "availability",
        "title": "2xx Success Rate for /todos/1 API",
        "objective": 99.9,
        "explanation": "Percentage of requests returning HTTP 2xx status codes",
        "threshold": 99.9,
        "unit": "",
        "window": "7d",
        "sli": {
          "prometheus": "sum(rate(http_requests_total{handler=\"/todos/1\",status=~\"2..\"}[5m]))/sum(rate(http_requests_total{handler=\"/todos/1\"}[5m]))*100",
          "gcp/cloudrun": {
            "displayName": "99.9% - Availability - Calendar day",
            "goal": 0.999,
            "calendarPeriod": "DAY",
            "serviceLevelIndicator": {
              "windowsBased": {
                "windowPeriod": "300s",
                "goodTotalRatioThreshold": {
                  "threshold": 0.999
                }
              }
            }
          }
        }
      },
      {
        "type": "error",
        "title": "5xx Error Rate for /todos/1 API",
        "objective": 1.0,
        "explanation": "Percentage of requests resulting in HTTP 5xx errors must remain low",
        "threshold": 1.0,
        "unit": "%",
        "window": "7d",
        "sli": {
          "prometheus": "sum(rate(http_requests_total{handler=\"/todos/1\",status=~\"5..\"}[5m]))/sum(rate(http_requests_total{handler=\"/todos/1\"}[5m]))*100",
          "gcp/cloudrun": {
            "displayName": "1% - 5xx Error Rate - Calendar day",
            "goal": 0.01,
            "calendarPeriod": "DAY",
            "serviceLevelIndicator": {
              "windowsBased": {
                "windowPeriod": "300s",
                "goodTotalRatioThreshold": {
                  "threshold": 0.01
                }
              }
            }
          }
        }
      }
    ]
    
                            
                          
    fault agent tool fault_run_latency_impact_scenario
                            # Report
    
    
    ## Scenario: Evaluating runtime performance of http://localhost:9090  (items: 1)
    
    ### 🎯 `GET` http://localhost:9090 | Passed
    
    **Call**:
    
    - Method: `GET`
    - Timeout: 10000ms
    - Headers: -
    - Body?: No
    
    **Strategy**: load for 10s with 1 clients @ 3 RPS
    
    **Faults Applied**:
    
    | Type | Timeline | Description |
    |------|----------|-------------|
    | latency | 0% `xxxxxxxxxx` 100% | Latency: ➡️🖧, Per Read/Write Op.: false, 
      Distribution: normal, Mean: 300.00 ms, Stddev: 0.00 ms |
    
    
    **Run Overview**:
    
    Num. Requests: 29
    Num. Errors: 0 (0.0%)
    Min. Response Time: 334.14
    Max Response Time: 462.50
    Mean Latency (ms): 339.96
    Total Time: 10 seconds and 214 ms
    
    | Latency Percentile | Latency (ms) | Num. Requests (% of total) |
    |------------|--------------|-----------|
    | p25 | 335.78 | 8 (27.6%) |
    | p50 | 339.96 | 15 (51.7%) |
    | p75 | 342.26 | 23 (79.3%) |
    | p95 | 406.86 | 29 (100.0%) |
    | p99 | 462.50 | 29 (100.0%) |
    
    | SLO       | Pass? | Objective | Margin | Num. Requests Over Threshold (% of total) |
    |-----------|-------|-----------|--------|--------------------------|
    | 99% @ 350ms | ❌ | 99% < 350ms | Above by 112.5ms | 2 (6.9%) |
    | 95% @ 200ms | ❌ | 95% < 200ms | Above by 206.9ms | 29 (100.0%) |
    
                          

AI-Code Generators Need Guidance

We believe software engineers experience has never been more necessary.


Lack of operational context

AI-Editors focus on your source code but are limited by what that they see. Operations are still out of their reach, thus leading to code that isn't always appropriate for your unique needs.

Refactoring pain is real

Acceleration in bootsrapping with AI-Generated code is undeniable. Yet, delivery pipelines remain time-constant as the output requires attention and refactoring before it can be pushed to production.

Modern Features for Hybrid Engineering Teams

AI is an evolution in the engineering industry journey. Not a rupture from its past.


  • Fully Open Source


    fault is open-sourced under a permissive license and part of Rebound family. Feel free to contribute to it

  • Made for Engineers


    fault is designed to work equally well with engineers and AI-agents in a team effort.

  • Choose Your LLM


    fault supports OpenAI, Gemini, OpenRouter and ollama. Pick up the LLM that best matches your working style.

  • MCP Server


    Plug fault into any AI-Code editor and help them produce sound production output with fault various tools.

  • Battery Included & Extensible


    Common network faults such as latency, blackhole or packet loss.

  • Complex Scenarios


    Easily schedule complex scenarios by configuring timelines of fault injection.

  • Support Any TCP Protocol


    While fault is well-suited for HTTP, it can be just as easily applied to any TCP-based protocol. Explore how your application reacts.

  • Observability Ready


    fault can be configured to send Open Telemetry traces to your favorite provider. Observe the effect of fault injection in your system.

  • eBPF Support


    Intercept traffic without changing anything in your application with fault's eBPF support. Turn fault injection into a stealth game.

Ready?

Download fault and build production-grade applications now.


Frequently Asked Questions

Everything you need to know about bringing fault into your organization.


What is fault?

fault is a developer product aiming at supporting engineers keep their high-standards while onboarding AI into their pipeline.

How is fault delivered?

fault is a rust-cli. It runs natively on Linux, macOSX and Windows.

Is fault free?

The fault CLI is free and open-source and will remain so. You will pay for any LLM model through your own subscription.

Do you offer enterprise commercial support?

As part of the Rebound family, we do indeed support fault commercially. Please feel free to reach out to us.

Do you upload my data anywhere?

The fault CLI doesn't send your data or code anywhere. If you use a Cloud-based LLM such as OpenAI, Gemini or Open Router, your code will be sent there partially indeed. If you want to remain fully private, we suggest you use ollama that fault also supports.

What programming language do you support?

Currently fault supports Python, Javascript, TypeScript, go and rust. More languages will likely be added.

Can I use fault without its AI features?

Absolutely. While fault provides AI-specific features, its core capabilities are not tied to AI at all. You can thus use fault to explore your reliability using the CLI directly. fault is made for all kinds of engineers!

Can I contributed to fault?

Yes! fault is open-source and love any kind of contributions. From typos to issue fixes or feature requests. Please join us.