Case study: the 25-repo benchmark
This is the most important page on this site. If cix worked beautifully on a couple of demo projects but fell apart on real codebases, the demo projects wouldn't matter. So we ran it on twenty-five real codebases and scored every run on a fixed rubric.
The result, summarized in one sentence: cix performs at a high, consistent level on standard backend stacks across multiple languages, performs well on large unconventional codebases, and falls down — honestly — on languages where the underlying parser doesn't exist yet.
This page lays out the rubric, the projects, the scores, and what we learned.
The rubric
Each project was evaluated on five dimensions, scored 0–3, for a 15-point maximum.
| Dimension | What it measures |
|---|---|
| Project understanding (general) | Does the system correctly identify language, framework, entry points, and major directories? |
| Architecture | Can it describe how the project is layered, what depends on what, and where the architectural boundaries are? |
| Schema, routes, handlers, external interfaces | Does it surface the database schema, HTTP routes, GraphQL handlers, and external interfaces accurately? |
| Change-impact | When asked to plan a non-trivial change, does it correctly identify the affected files, symbols, and tests? |
| Breakage analysis | When asked "what would break if X changed," does it produce an accurate, structured impact report? |
A project that scored 3/3 on a dimension required no fallback to file-reading or grep — cix-only queries produced the right answer. A project that scored 0 needed full fallback because cix couldn't help.
The projects
Twenty-five projects were chosen to span:
- Languages: Python, TypeScript, JavaScript, Go, Java, C#, PHP, Ruby, Lua.
- Sizes: From small CRUD demos (40 files) to multi-million-line monorepos (Apache Airflow, Kong, Loki, Saleor, Strapi, Cal.com).
- Frameworks: Flask, FastAPI, Django, Spring Boot, ASP.NET Core, Laravel, Rails-style, Express, Next.js, Vue, Nuxt, Strapi, Saleor, Directus.
- Architectures: Standard MVC monoliths, modular monoliths (Loki), microservice fragments, dynamic plugin hosts (Kong, rclone), monorepos (Airflow, Cal.com, Turbo).
The set was deliberately chosen to include projects where cix would do well and projects where it would struggle. A benchmark that only included favorable cases would not be useful.
The aggregate result
| Score | Projects | Description |
|---|---|---|
| 15/15 | 1 | example-flask-crud — clean sweep |
| 14/15 | 8 | flask-sample-app, fastapi-realworld, gestion, python-demoapp, monica, node-express-realworld, spring-boot-realworld, directus-v2 |
| 13/15 | 4 | aspnetcore-realworld, netbox, hackathon-starter, airflow |
| 12/15 | 1 | kong (Python re-eval — excluding the Lua surface) |
| 11/15 | 1 | turbo |
| 10/15 | 6 | loki, directus-v1, cal.com, strapi, vue-starter-kit, saleor |
| 9/15 | 1 | rclone |
| 4/15 | 1 | kong (Lua surface, dedicated parser run) |
| 2/15 | 1 | kong (earlier Lua run) |
| 1/15 | 1 | kong (earliest Lua run) |
Reading the table: fourteen projects out of twenty-five scored 13/15 or higher. Twenty-one projects out of twenty-five scored 9/15 or higher. The four lowest scores are all the same project — Kong, the Lua-based API gateway — at different stages of attempting to evaluate it.
Where cix shines
Standard backends in supported languages. Flask, FastAPI, Django, Spring Boot, ASP.NET Core, Express, and Laravel projects all scored 13–15. The system understood the stack, surfaced the routes, parsed the schema, and produced clean impact reports.
Schema-driven workflows. Across every project with parsed migrations, the schema view was the highest-leverage feature — a single query returned tables and columns that grep would have needed twenty file reads to assemble.
Large monorepos with conventional structure. Apache Airflow scored 13/15 despite being a multi-million-line codebase. cix correctly identified the FastAPI entry points, surfaced 209 routes, exposed 74 tables, and described the architectural boundaries — all without falling back to file-reading. (See the Airflow deep-dive.)
Honest behavior at the boundary of capability. On every project where cix's coverage was partial, the system flagged it explicitly — "schema is partial," "language detection low confidence," "X% of files unsupported." This is what makes the high scores trustworthy: when cix says it knows, it knows.
Where cix struggles
Lua codebases. Kong is the canonical case. The project is 97% Lua. cix has no Lua parser today. Result: scores in the 1–4/15 range. The system was honest about why — orientation flagged "this codebase is mostly Lua, parsing is unsupported, expect to fall back" — but the practical value to the assistant on this project is low.
Dynamic route registration. Loki, rclone, and to a lesser extent Kong all have their primary HTTP surface registered programmatically at runtime rather than declaratively. The route view captures only the static portion. The system surfaces what it can find but cannot enumerate the dynamic surface — and clearly marks the partial coverage.
Same-name symbol disambiguation. When two unrelated types share a name across packages (rclone has both fs.Object and lib/http/serve.Object), the impact analysis sometimes resolves to one when the user meant the other. The output flags this as a confidence issue, but in time-pressed workflows it's a real friction point.
Test fixtures bleeding into route lists. A few projects had test-only or example-only routes that appeared in the route view alongside production routes. cix has no native concept of "test fixture" beyond directory heuristics. This is solvable; today it requires the user to filter.
These are real limitations, not benchmarking artifacts. They are documented explicitly on the Limitations page and tracked as open work.
What the benchmark proves
- High floor on standard stacks. If your project is a Python/Java/C#/JS/PHP/Go backend in a common framework, cix delivers a high level of grounded, useful behavior.
- Honest behavior on edge cases. The system is not lying about what it can see. The "I don't know" responses are valuable signal, not failure.
- Coverage gaps are gaps, not catastrophic failures. Even on the partial-coverage cases (Loki, rclone), cix still scored 9–10/15 because the parts it did cover were accurate.
What the benchmark does not prove
- It does not prove the score on your project. Projects vary. The benchmark is suggestive, not predictive. We recommend running cix on a real project of yours before making adoption decisions. (How to do that.)
- It does not capture every workflow. The rubric covers project understanding and refactoring. It does not measure raw IDE-style "go to definition" speed or every possible feature interaction.
- It does not measure long-tail languages. Projects in less-common languages (Erlang, Elixir, Crystal, Zig) were not in the test set and would likely score similarly to Lua today.
Methodology notes
- Each project was evaluated cold, with a fresh index built from the current state of the repo.
- The same evaluator and the same prompt were used across every run.
- Where fallback to file-reading was needed, it was explicitly noted in the per-project writeup.
- Per-project writeups document which features helped, which fell short, and what fallback was used.
- A standard scoring rubric (0–3 per dimension, 5 dimensions, 15 total) was applied uniformly.
The full per-project writeups (all 25, with raw findings, scores, and discussion) are maintained internally and can be made available on request.
Related
- Pre-release cleanup case study — a measured side-by-side on a single project.
- Apache Airflow deep-dive — focused look at one of the most demanding projects in the set.
- Limitations — the gaps surfaced by this benchmark, in our own words.
- Benchmarks — aggregate scoring view.