What Is Tree-sitter?
Tree-sitter is an incremental parsing framework that generates concrete syntax trees (CSTs) for source code without requiring a full compiler toolchain. Originally developed for syntax highlighting in code editors, Tree-sitter can parse code in 100+ programming languages using community-maintained grammars, producing structured parse trees that enable structural analysis of source code regardless of whether the code compiles or is complete.
Why It Matters
Traditional code analysis tools often rely on compiler-based parsing, which requires the code to compile successfully — with all dependencies installed, build tools configured, and environment variables set. This requirement makes it difficult to analyze codebases from outside the development environment, which is exactly the scenario in due diligence, compliance auditing, and cross-team architectural assessment.
Tree-sitter solves this by parsing source code at the syntax level without executing it or resolving its full dependency tree. It can parse incomplete files, files with syntax errors, and files from projects where the build environment is not available. This robustness makes it suitable for analyzing codebases as they exist, not as they would exist in an ideal build environment.
Tree-sitter's incremental parsing capability means it can efficiently re-parse files that have changed without re-parsing the entire codebase — enabling efficient integration with continuous analysis workflows.
How It Works
Tree-sitter operates through a grammar-driven parsing pipeline.
Grammar definition: Each programming language has a Tree-sitter grammar — a formal specification of the language's syntax written in Tree-sitter's JavaScript-based grammar DSL. The grammar defines the production rules that describe how tokens combine into syntax tree nodes.
Parser generation: The grammar is compiled into a C-based parser using Tree-sitter's parser generator. The generated parser is a state machine optimized for incremental parsing, capable of re-parsing only the portions of a file that changed.
Parsing: The generated parser processes source code and produces a concrete syntax tree (CST) — a full representation of every token and structural element in the file. The CST preserves whitespace, comments, and all syntactic detail.
Querying: Tree-sitter provides a query language (S-expression pattern matching) that allows extracting specific structural patterns from the CST — function definitions, import statements, class hierarchies, and other patterns relevant to architectural analysis.
The combination of robust parsing (handles broken code), breadth (100+ languages), and efficiency (incremental re-parsing) makes Tree-sitter the foundation of modern polyglot code analysis.
How Axiom Refract Addresses This
- Axiom Refract's parsing layer is built entirely on Tree-sitter, with 145+ language grammars and 103 hand-hardened for production-grade depth
- Hand-hardening involves testing each grammar against thousands of real-world repositories and tuning extraction rules for edge cases
- Tree-sitter enables Axiom to analyze any codebase without requiring build tools, compilers, or development environment setup