Skip to Content

Code Graph (MCP)

@ttsc/graph has its own benchmark, separate from the compiler timings above.

It asks the two questions codegraph publishes: how completely does the graph resolve a real codebase, and does an agent spend less when it has the graph. The agent-cost runner also benchmarks codegraph itself as a comparator. Both run on real repositories through experimental/benchmark/graph.mjs and the lower-level harnesses in experimental/graph-bench, and the raw per-run samples are published to website/public/benchmark/graph.json.

The agent-cost benchmark now splits into two prompt families. Shared onboarding prompts start by reading exported symbols and folder structure, then narrow into graph queries. Project-specific prompts live in files like experimental/graph-bench/questions/typeorm.md. The runner stores promptFamily in graph.json, and the dashboard reports those families separately.

The agent-cost question is measured across two AI coding agents, each acting as a harness. An AI agent here is an autonomous CLI that reads the repository and answers a question by calling tools (shell, grep, read, or the @ttsc/graph MCP), measured once with the MCP server configured and once without it. The two harnesses are Claude Code (Anthropic’s CLI) running Sonnet and Opus, and Codex (OpenAI’s CLI) running GPT-5.5 and GPT-5.4 mini.

The savings below are one axis. They are not the whole value.

The graph fuses static structure and live diagnostics into one checker-resolved ontology. So an agent can also see a change’s reach over what is currently broken before it edits. See Code Graph.

Loading graph benchmark results...

Agent cost: does the graph save an agent’s tokens

This is a faithful port of codegraph’s headline benchmark. For codegraph’s own question it runs the agent CLI headless twice against the same repository: once with a graph MCP server configured, once with an empty MCP config, both under --strict-mcp-config so the only difference is the graph tool. It records codegraph’s metrics, total tokens summed across every assistant turn, tool-call count, cost, and wall time. Within a repository, tool, and model each metric is the median of N runs, as codegraph aggregates.

The tool axis is @ttsc/graph versus codegraph. @ttsc/graph builds from the live TypeScript Program; codegraph is served from a .codegraph/ index produced by codegraph init. The runner records that index time as toolSetupMs beside the codegraph cells.

The Excalidraw table below is Claude Sonnet 4.6 (claude-sonnet-4-6) at reasoning effort high. Within each model both arms hold the model and effort fixed, so the only variable is the graph. The cross-model section repeats the same A/B on Opus 4.8, because the figures move with the model.

A checker-resolved graph is TypeScript-only, so of codegraph’s seven benchmark repositories only the two TypeScript ones are runnable: Excalidraw and VS Code. The other five are Python, Rust, Java, Go, and Swift.

Project-specific prompt file results

These cells use repository-specific question files like experimental/graph-bench/questions/typeorm.md.

Excalidraw

Codegraph’s question, “How does Excalidraw render and update canvas elements?”, median of 6 runs on Claude Sonnet 4.6:

MetricEmpty-MCP baseline@ttsc/graphSaved
Tokens1,875,668264,43886%
Tool calls502.595%
Cost$0.335$0.11665%
Wall time245s41s83%

The agent read zero files in every graph run. It named the relevant symbols in one or two query_nodes calls and answered from the result, against a baseline that read 19 to 31 files per run to reconstruct the same flow.

The graph hands the agent the relevant declarations, their checker-resolved relationships, and their source in a couple of calls, so it stops fanning out across files.

The win holds across models

The same Excalidraw A/B on Opus 4.8, alongside the Sonnet 4.6 run above, each the median of 6 runs in graph.json.

Model (Excalidraw)Tokens savedTool calls savedGraph behavior
Claude Sonnet 4.686%95%answers in ~2 calls, reads 0 files
Claude Opus 4.877%94%answers in ~3 calls, reads 0 files

Both models name the symbols they need in one broad query and answer from it, reading zero files. Opus, the more thorough reader, used to drill the flow one symbol at a time in nine calls; once a broad query returns the whole cluster, it batches like Sonnet and lands in three or four.

Codegraph’s own cross-repository headline, over its seven repositories, is 47% fewer tokens and 58% fewer tool calls. On this Excalidraw question a checker-resolved graph saves more, because the resolved relationships are exact enough that the agent stops re-reading the source.

VS Code, the large-repository path

VS Code is codegraph’s other TypeScript repository, around 7,000 src files. A full TypeScript-Go Program over it builds in about three minutes, too slow to repeat for every agent session.

So @ttsc/graph serves it through a daemon. ttscgraph --daemon type-checks the project once and serves every session behind a localhost port; each session spawns a thin ttscgraph --connect proxy. The one-time build is the price of a real type checker over tree-sitter, paid once and amortized across every session.

The VS Code A/B is not in the published graph.json yet. Run it with the daemon (see Reproduce) to add it.

What made the difference

Running this benchmark surfaced a real bug. The server originally type-checked the whole project before answering the MCP handshake, so on a repository of any size it sat in the pending state with no tools advertised, and the agent fell back to grep and read. The fix is lazy init: the server answers initialize and tools/list immediately and builds the resident Program in the background, so the agent always sees the tools and the first query blocks on the build. Two smaller fixes followed, ranking declarations against a natural-language query by name match and edge centrality so a flow question surfaces the right symbols, and bounding the response (edge cap, per-node source cap, total budget) so a broad match returns a few kilobytes instead of tens.

Codegraph’s own harness excludes cold-start-raced runs by default, which confirms the race is a known MCP failure mode and not a quirk of this port. Lazy init removes it at the source, so no run here is excluded.

Structural: how completely does the graph resolve

The structural benchmark loads a project’s resident Program, builds the graph, and reports the node and edge counts plus the fair coverage, the share of symbol-bearing source files that have at least one resolved cross-file edge. Counts and coverage are deterministic; timings are indicative and only meaningful on a quiet host.

Run against this repository’s packages/ttsc (53 source files):

MeasureValue
Nodes605 (71 external boundary leaves)
Edges1583 (heritage 2, value-call 1153, type-ref 428)
Fair coverage100.0% (50 of 50 symbol-bearing files cross-linked)
Load time69 ms (median)
Graph build42 ms (median)

The 100% is the coverage flex: every symbol-bearing file has at least one checker-resolved cross-file edge. Modeling methods and their calls closed the gaps a top-level-only graph left (it was 92% before), which is the whole point. A method-to-method call is exactly the relationship an agent would otherwise open a file to find. Build time is a real fraction of the already fast load on a small tree and shrinks as type-checking dominates on larger ones. The point is that extraction rides the Program the compiler already built, with no second compile.

Reproduce

Two harnesses back this page, and they cost very differently. The structural harness is deterministic and cheap: it loads a project’s resident Program, builds the graph, and counts. The agent-cost harness drives real agent CLIs against real repositories and spends model credits, so it is deliberately kept out of CI and run by hand.

Both start from a clone of this repository:

git clone https://github.com/samchon/ttsc.git cd ttsc corepack enable pnpm install

Structural harness

experimental/graph-bench loads the resident Program for a project, builds the graph, and reports the node and edge counts plus the fair coverage. The counts and coverage are deterministic; the timings are indicative and only meaningful on a quiet host.

node experimental/graph-bench/bench.mjs --project=/abs/path/to/project --runs=5

--runs repeats the load-and-build and reports the median time; the counts do not move between runs. Point --project at any TypeScript project with a tsconfig.json, including this repository’s own packages/ttsc.

Agent-cost harness

experimental/benchmark/graph.mjs measures the A/B. For each cell it checks out the fixture’s prepared branch, then runs the agent CLI headless twice against the same repository under --strict-mcp-config: once with the graph MCP server configured, once with an empty MCP config, so the graph tool is the only difference between the two arms. It records the metrics codegraph publishes, total tokens across every assistant turn, tool-call count, cost, and wall time, and reduces the N runs of each cell to their median.

It needs go on PATH to build the native binary and the agent CLI for each model arm you select (claude for the Claude arms). The codegraph comparator arm additionally builds a .codegraph/ index with codegraph init, recorded as toolSetupMs. Because every cell spends real model credits, this runner is not wired into CI.

# One fixture, the Claude arms, both tools, then update graph.json node experimental/benchmark/graph.mjs --project=typeorm --models=sonnet,opus --tools=ttsc-graph,codegraph # Full sweep: projects run sequentially, since the agent cells are expensive node experimental/benchmark/graph.mjs --all --models=sonnet,opus --tools=ttsc-graph,codegraph --prompt-family=all

The runner writes the raw per-run samples to experimental/benchmark/.work/graph/<timestamp>/report.json and upserts only the cells it measured into website/public/benchmark/graph.json, leaving every other cell untouched. Pass --no-website for a local-only run that does not touch the published JSON.

Last updated on