Case studies

Marketing copy is easy to produce. Measured behavior on real codebases is harder. This section is the harder kind.

Each case study below is a real run of cix against real code, with methodology, raw findings, and honest discussion of where cix excelled and where it fell short.

Pre-release cleanup: cix vs. a less-capable indexer

A Laravel 12 + Vue 3 SPA, mid-rename, with a half-propagated refactor in flight. The same cleanup prompt was run twice — once with a symbol-only indexer and once with cix — by the same evaluator, cold start each time.

Highlights:

Eight findings in common.
Three additional findings only cix surfaced — including a forgotten Python script in a public-served directory hardcoding root MySQL credentials.
Roughly half the tool calls (29 vs. 55).
Roughly a third of the tokens (30–40k vs. 80–100k).

Read the full case study →

The 25-repo benchmark

A standardized evaluation rubric run against 25 real-world open source projects spanning Python, TypeScript, Go, Java, C#, PHP, and Ruby. Each repo was scored on a 15-point rubric covering project understanding, change-impact analysis, and breakage analysis.

Highlights:

15/15 on a small Flask CRUD app — clean sweep across all rubric dimensions.
14/15 on standard backend stacks — Flask variants, FastAPI, Django, monolithic Rails-style apps.
13–14/15 on multi-thousand-file backends — Spring Boot, ASP.NET Core, large Laravel apps, Apache Airflow.
10–13/15 on large unconventional codebases — Loki, NetBox, Strapi, Saleor.
1–4/15 on Lua-heavy codebases — Kong scored low because cix has no Lua parser yet. The system was honest about why.

The benchmark also surfaced specific gaps worth tracking — dynamic-route registration patterns where the route view is partial, same-name symbol disambiguation in some impact queries, and the Lua parser gap.

Read the full benchmark →

Apache Airflow deep-dive

A focused walkthrough of cix on Apache Airflow — a large Python monorepo with a hundred-plus provider packages, multiple FastAPI applications, and seventy-four database tables.

Highlights:

Detected the FastAPI/Alembic/SQLAlchemy stack correctly.
Surfaced 209 HTTP routes and the schema across 74 tables in a single orientation query.
Identified entry points, request surfaces, and architecture boundaries without falling back to grep.
Honestly flagged where the schema parser was incomplete (some Alembic migrations that mutate rather than create tables).

This case is included to show how the system behaves on a project at the upper end of size and complexity — and where the partial-coverage signals show up.

Read the deep-dive →

What these have in common

A consistent pattern emerges across the cases:

On standard stacks, cix dramatically reduces the work an assistant has to do — fewer file reads, fewer tokens, more structured analysis, more catches.
On non-standard stacks, the picture is mixed. The system is honest about what it can and can't see; the user is on solid ground when deciding how much to trust each query.
On unsupported languages, the system fails openly. There is no scenario where cix returns confident-looking nonsense when the underlying parser doesn't exist.

That last property is what makes the others trustworthy. A tool that gracefully says "I don't know" is a tool you can build a workflow around.

Where to go next

Benchmarks — the aggregate scoring view.
Limitations — explicitly where the system underperforms.
For evaluators — how to run your own measurement.