Substitutability-Based Grammar Induction
Overview
This project tries to induce grammar by looking for substitutability: tokens or spans that can replace one another in similar contexts. The motivating example is that words like eat and ate can behave as structurally related objects because of the contexts they share, even when frequency-based tokenization would not give that relationship any special status.
The goal is to discover grammatical categories from use, then export those categories as readable grammar rules.
Core Idea
Frequency-based tokenizers such as BPE and WordPiece are excellent engineering tools, but frequency is not the same thing as structure. A rare word can still occupy a clear grammatical role; a common fragment can still be syntactically uninteresting.
Substitutability gives a different signal:
- Find contexts where multiple spans can occur.
- Treat the interchangeable spans as evidence for a latent category.
- Use those categories to compress held-out data.
- Prefer grammars that explain new data with shorter descriptions.
The formal lens is Kolmogorov conditional complexity: K(Y | X) asks how much information is still needed to describe held-out data Y after learning patterns from X.
Implementation
The implementation combines Python experimentation with Rust components for the heavier matching work. The matcher searches for hierarchical patterns, parallelizes the expensive parts with Rayon, and emits BNF-style grammars for inspection.
Why It Matters
The project is really about whether grammar can be recovered from relational evidence instead of surface frequency. If substitutability gives a useful compression signal, it becomes a way to learn structure that is closer to syntax than token statistics.