Your Programming Language Is an LLM Bottleneck

The Experiment

The programming language you write in determines how much of your code an LLM can actually see. Not metaphorically. Literally. A Python function and its C# equivalent encode the same logic, but the C# version consumes up to 90% more tokens — which means it takes up 90% more of the model's context window, costs 90% more to process, and leaves less room for everything else the model needs to reason about.

I set out to quantify this. Using over 600 tasks from the Rosetta Code dataset — the same logic implemented across multiple languages, yielding thousands of individual implementations — I tokenized every solution with OpenAI's o200k_harmony tokenizer (used by recent OpenAI models including o3, o4-mini, and the GPT-5 series). Using the shortest correct implementation per task — a floor measurement whose limitations I discuss in Section 06 — the question was simple: for the same logic, how many tokens does each language need?

The gap between the most and least efficient languages was not 10-15% — it was nearly 2x. And the pattern held across hundreds of tasks.

The Rankings

I computed a Relative Conciseness Index for each language: for each task, I divided the language's token count by the median token count across all languages for that task, then took the median of those per-task ratios. A ratio below 1.0 means the language is more concise than average. Above 1.0, more verbose.

Token density ranking — lower is better

Rank	Language	Type System	Median Ratio	Tier
1	Python	Dynamic	0.657	High Density
2	JavaScript	Dynamic	0.691	High Density
3	F#	Static	0.839	High Density
4	TypeScript	Static	1.080	Medium
5	Rust	Static	1.229	Low Density
6	C#	Static	1.249	Low Density

Read the table this way: if you normalize C# (the least dense language here) to a baseline of 1.0x, Python gives you 1.90x the effective context window — nearly double the logic in the same token budget. JavaScript gets 1.81x. F# gets 1.49x. Even TypeScript, which adds type annotations on top of JavaScript, gets 1.16x.

1.90x

Python

1.81x

JavaScript

1.49x

1.16x

TypeScript

1.02x

Rust

1.00x

For a 128k context window, this is not academic. A C# codebase hits the context ceiling with roughly half the business logic loaded compared to a Python one. The model has less to work with. Not because the code is worse, but because the language spends more tokens saying the same thing.

What this means practically

Choosing a low-density language means your context window fills faster with syntax and slower with logic. Every curly brace, explicit type declaration, and boilerplate class definition consumes tokens that could hold actual reasoning material. The model must process your syntax to reach your intentions — and verbose syntax means more processing before it gets there.

Case Study: F# vs C# on the Same Runtime

The most controlled comparison in the dataset is F# versus C#. Same runtime (.NET), same standard library, same ecosystem — the primary variable is the language paradigm. Functional-first (F#) versus traditionally object-oriented (C#), though recent C# versions have adopted many functional features like pattern matching and records. Across 607 shared task implementations, the pairwise comparison is clear:

C# Median

146

tokens per task

p10-p90: 38–419

F# Median

tokens per task

p10-p90: 31–276

F# requires roughly 34% fewer tokens to express the same logic (96 vs 146). Inverted: switching from C# to F# lets you fit roughly 52% more code into the same context window.

But the density data alone does not capture the full picture — the following observation goes beyond what the token counts measure, but it is relevant to practical outcomes. Unlike Python — which tops the density rankings but offers no compile-time guarantees — F# pairs its conciseness with one of the strongest type systems in mainstream use. Every function signature carries type information that the compiler verifies before the code runs. For AI-assisted development, this creates a feedback loop that dynamic languages cannot match: the LLM generates code, the compiler immediately tells it what is wrong, and the model fixes it — without ever executing the program or waiting for a runtime exception buried three calls deep in a stack trace.

The gains are not uniform. F# excels in algorithmic and mathematical domains where pattern matching, type inference, and immutable data structures replace verbose imperative constructs:

Largest efficiency gaps — F# vs C#

Task	F# Tokens	C# Tokens	Gain
Catalan Numbers	54	649	12x
Least Common Multiple	38	~233	~5x
Digital Root	90	~400	~4.5x

The token counts above come from the full Rosetta Code implementations, which are often longer than the abbreviated examples below — particularly the C# version, which includes additional formatting, error handling, and documentation in the original. The 12x gap on Catalan Numbers is an extreme outlier driven by this disparity, but it illustrates the structural pattern: C# requires a class declaration, a namespace, explicit type annotations, and braces around every block. The F# version is a compact recursive definition with pattern matching. Same output. Same correctness. Dramatically fewer tokens for the LLM to parse.

Catalan Numbers — abbreviated implementations

let rec catalan = function
    | 0 | 1 -> 1
    | n -> List.init n (fun i ->
        catalan i * catalan (n - 1 - i))
        |> List.sum

[0..14] |> List.map catalan |> printfn "%A"

Every line of boilerplate in the C# version — the using, the namespace, the class wrapper, the explicit return types, the braces — is a token the LLM must process that carries zero algorithmic information. The F# version encodes only the logic. The language itself provides the structure.

Where the Tokens Go

The token density gap comes from three categories of overhead, each contributing differently depending on the language:

Anatomy of token overheadtext

1. Structural Boilerplate
   Namespaces, class declarations, access modifiers, explicit
   Main() entry points, using/import statements.
   → C# and Rust pay heavily here
   → Python and F# pay almost nothing

2. Type Ceremony
   Explicit type annotations, generic type parameters,
   return type declarations, interface implementations.
   → Rust and TypeScript pay here (though TypeScript less so)
   → Python (untyped) and JavaScript pay nothing
   → F# minimizes this via type inference

3. Syntactic Verbosity
   Curly braces, semicolons, parentheses around conditions,
   explicit "return" keywords, "var"/"let"/"const" keywords.
   → C# and Rust: braces + semicolons on every line
   → Python: significant whitespace eliminates all of this
   → F#: minimal — uses indentation + pipe operators

This decomposition explains the rankings. Python wins because it pays almost nothing in any category — no structural boilerplate, no type ceremony, and minimal syntax. F# beats C# despite both being statically typed because type inference and pattern matching eliminate categories 2 and 3 nearly completely.

TypeScript is the interesting middle case. It adds type annotations on top of JavaScript (category 2 cost), but avoids the structural boilerplate of C# (no classes required, no namespaces). The result is a 1.08x ratio — almost exactly at the median. The types cost you tokens, but the flexible syntax saves them back.

The Rust surprise

Rust lands in the low-density tier despite its reputation for expressiveness. The culprit is not the borrow checker — it is the syntactic overhead. Explicit lifetime annotations, turbofish syntax, match arms with braces, and verbose error handling patterns all add tokens. Rust's macro system helps on some tasks, but across 600+ benchmarks, the overhead wins.

What This Actually Costs You

Token density is not an abstract metric. It has direct consequences for three things that matter when you are building with LLMs:

Context Capacity

A denser language means more relevant code fits in the same context budget. When your AI agent retrieves code for analysis, that translates directly to broader coverage of the codebase in a single prompt — reducing the "lost in the middle" problem where models ignore information buried deep in long prompts.

Inference Cost

Token-based pricing means verbose languages cost more to process. The density gap between the top and bottom of the rankings translates directly to higher input token costs per request — a markup that compounds across every prompt your team sends. At tens of thousands of prompts per month, this becomes a visible line item in your infrastructure budget.

RAG Quality

Codebase retrieval systems chunk code into fixed-token windows. A 500-token chunk of Python contains a complete function or module. A 500-token chunk of C# might capture imports, a class header, and half a method — forcing the model to retrieve more chunks to understand the same logic.

To make this concrete: take a team running an AI coding assistant against a large codebase. At C#-level density, the retrieval system needs to pull significantly more chunks to give the model equivalent coverage compared to Python. That means more tokens per prompt, higher latency, and a higher chance that the model misses a relevant function because it got bumped out of the context window by boilerplate from another file.

Context budget comparison — 128k windowtext

Available context:        128,000 tokens

System prompt + history:   15,000 tokens
Remaining for code:       113,000 tokens

Python codebase (~65 tokens / function avg):
  → ~1,738 functions fit in context
  → Full module-level understanding

C# codebase (~120 tokens / function avg):
  → ~941 functions fit in context
  → 46% less code visible to the model

The model is not less capable with C#. It simply sees less.

Limitations and Honest Caveats

Rosetta Code is not production code. These are small, self-contained algorithmic tasks. Real-world C# projects include dependency injection, Entity Framework queries, ASP.NET middleware patterns. Real-world Python includes Django model definitions, dataclass hierarchies, and increasingly — full type annotations. The ratios for production codebases will differ. The question is by how much, and in which direction.

Shortest valid implementation ≠ idiomatic code. The dataset uses the shortest correct implementation per task. Nobody writes production C# as terse as a code golf submission. Nobody writes production Python without docstrings and error handling either. This measures the floor of each language's verbosity, not the typical level. I expect the relative rankings to hold (a verbose language stays verbose), but the absolute ratios are probably compressed — real-world gaps may be larger or smaller depending on team conventions.

One tokenizer does not rule them all. I used o200k_harmony (OpenAI). Claude uses a different tokenizer. Gemini uses another. The relative rankings should be stable across tokenizers — all tokenizers must encode the same structural characters — but the exact ratios will shift by a few percentage points. Running this analysis across multiple tokenizers would strengthen the findings significantly.

Types might help the model reason — and the compiler feedback loop matters. This is the strongest counterargument to raw density rankings. Explicit type annotations consume tokens, yes — but they also give the LLM structured information about the code's semantics. A function signature with full types may cost more tokens but produce better completions because the model understands the contract. And as discussed in Section 03, a strong compiler creates a tight correctness feedback loop that dynamic languages cannot match. I did not measure output quality or iteration speed — only input efficiency. A complete analysis would need to weigh token cost against reasoning benefit and correction speed.

The honest takeaway

This data does not say "rewrite your C# codebase in Python." It says: language choice has a measurable impact on how effectively LLMs can work with your code, and that impact compounds across every prompt, every retrieval, and every dollar spent on inference. For new projects and greenfield modules, it is a factor worth considering alongside all the other factors that drive language choice.

What to Do With This Information

I am not suggesting anyone rewrite their codebase. But this data points to several concrete actions that are worth considering:

For new projects: If your codebase will be heavily consumed by LLMs — whether through AI coding assistants, automated code review, or agent-driven development — token density deserves a seat at the language selection table alongside performance, ecosystem, and team expertise. It is a new axis of evaluation that did not exist two years ago.

For existing codebases: Invest in your retrieval layer. If you are working in a verbose language, your RAG pipeline needs to be smarter about chunking — using AST-aware splitting rather than fixed-token windows, stripping import blocks and boilerplate before embedding, and prioritizing function bodies over structural scaffolding. The language tax can be mitigated with better tooling.

For teams choosing a language for AI-native development: Python tops the density rankings, but density is not the only axis. The highest-density language with no compiler gives your AI agent more room to work with — and no guardrails when it generates incorrect code. F# offers a different tradeoff: near-Python density with full compiler verification. The type system catches errors at compile time, giving autonomous agents a fast, deterministic feedback loop that Python cannot provide. For codebases where AI agents will generate, refactor, and iterate on code autonomously, this combination of density and correctness guarantees may be the stronger choice.

For the .NET ecosystem specifically: F# is worth evaluating for new modules. It shares the runtime and interop story, so adoption is incremental. The data suggests a roughly 50% density improvement over C# with no loss in type safety — a meaningful gain if your workflow relies heavily on LLM-driven development.

For tooling builders: If you are building AI development tools, context compression should be part of your pipeline. Strip comments, collapse import blocks, normalize formatting before tokenizing. The density gap between languages shrinks when you preprocess intelligently — and your tool becomes more effective across all languages.

The New Cost of Syntax

For fifty years, the cost of programming language verbosity was measured in keystrokes and screen real estate. Developers chose languages based on performance, ecosystem, and personal preference. Verbosity was a style choice, not a cost driver.

LLMs changed the economics. Every token of syntactic overhead is now a literal cost — in dollars, in context window capacity, and in model reasoning quality. A curly brace is no longer free. An explicit type annotation has a price. A namespace declaration occupies space that could hold business logic.

Consider an inversion that would have sounded absurd five years ago: in terms of sheer volume, AI agents now process your code far more frequently than human engineers do. Code review bots, autonomous coding assistants, CI/CD agents that analyze diffs, retrieval systems that index your repository — these systems read every file in your codebase on every commit, every pull request, every prompt.

If the most frequent consumer of your code has changed, the criteria for what makes a language "readable" deserves reexamination. Optimizing for AI agent comprehension — dense, high-signal, low-ceremony code — may matter more for tooling efficiency than optimizing for the visual scanning of curly braces and explicit type annotations. That does not mean human readability stops mattering. It means the balance has shifted, and most teams have not updated their priors.

The data is open source. The methodology is reproducible. Add your own languages, swap tokenizers. If the findings hold across your own codebases and tokenizer choices, then token density belongs in every language evaluation from here forward.