Case studies
Marketing copy is easy to produce. Measured behavior on real codebases is harder. This section is the harder kind.
Each case study below is a real run of cix against real code, with methodology, raw findings, and honest discussion of where cix excelled and where it fell short.
Pre-release cleanup: cix vs. a less-capable indexer
A Laravel 12 + Vue 3 SPA, mid-rename, with a half-propagated refactor in flight. The same cleanup prompt was run twice — once with a symbol-only indexer and once with cix — by the same evaluator, cold start each time.
Highlights:
- Eight findings in common.
- Three additional findings only cix surfaced — including a forgotten Python script in a public-served directory hardcoding root MySQL credentials.
- Roughly half the tool calls (29 vs. 55).
- Roughly a third of the tokens (30–40k vs. 80–100k).
The 25-repo benchmark
A standardized evaluation rubric run against 25 real-world open source projects spanning Python, TypeScript, Go, Java, C#, PHP, and Ruby. Each repo was scored on a 15-point rubric covering project understanding, change-impact analysis, and breakage analysis.
Highlights:
- 15/15 on a small Flask CRUD app — clean sweep across all rubric dimensions.
- 14/15 on standard backend stacks — Flask variants, FastAPI, Django, monolithic Rails-style apps.
- 13–14/15 on multi-thousand-file backends — Spring Boot, ASP.NET Core, large Laravel apps, Apache Airflow.
- 10–13/15 on large unconventional codebases — Loki, NetBox, Strapi, Saleor.
- 1–4/15 on Lua-heavy codebases — Kong scored low because cix has no Lua parser yet. The system was honest about why.
The benchmark also surfaced specific gaps worth tracking — dynamic-route registration patterns where the route view is partial, same-name symbol disambiguation in some impact queries, and the Lua parser gap.
Apache Airflow deep-dive
A focused walkthrough of cix on Apache Airflow — a large Python monorepo with a hundred-plus provider packages, multiple FastAPI applications, and seventy-four database tables.
Highlights:
- Detected the FastAPI/Alembic/SQLAlchemy stack correctly.
- Surfaced 209 HTTP routes and the schema across 74 tables in a single orientation query.
- Identified entry points, request surfaces, and architecture boundaries without falling back to grep.
- Honestly flagged where the schema parser was incomplete (some Alembic migrations that mutate rather than create tables).
This case is included to show how the system behaves on a project at the upper end of size and complexity — and where the partial-coverage signals show up.
What these have in common
A consistent pattern emerges across the cases:
- On standard stacks, cix dramatically reduces the work an assistant has to do — fewer file reads, fewer tokens, more structured analysis, more catches.
- On non-standard stacks, the picture is mixed. The system is honest about what it can and can't see; the user is on solid ground when deciding how much to trust each query.
- On unsupported languages, the system fails openly. There is no scenario where cix returns confident-looking nonsense when the underlying parser doesn't exist.
That last property is what makes the others trustworthy. A tool that gracefully says "I don't know" is a tool you can build a workflow around.
Where to go next
- Benchmarks — the aggregate scoring view.
- Limitations — explicitly where the system underperforms.
- For evaluators — how to run your own measurement.