Last indexed: 13 February 2026 (50f4d4)

Tokenization

Overview

Tokenization is the lexical analysis phase of the query translation pipeline. It transforms a raw query string into a sequence of structured Token objects, each representing the smallest syntactic unit of the Galach query language. This phase precedes parsing and is responsible for recognizing language constructs through pattern matching.

The tokenization system consists of three primary components: the Tokenizer class that performs the tokenization algorithm, the TokenExtractor abstract class that defines recognition patterns, and the resulting TokenSequence value object. The system is designed to be resilient—when no valid token can be recognized at a position, a special BAILOUT token is created to allow processing to continue.

For details on the specific token types recognized by the system, see Token Types. For information on implementing custom token recognition, see Token Extractors. For the algorithmic details of the tokenization process, see Tokenization Process.

Sources: lib/Languages/Galach/README.md33-74

Component Architecture

The tokenization system follows a strategy pattern where the Tokenizer class delegates token recognition rules to a TokenExtractor implementation. This separation allows customization of the language syntax without modifying the tokenization algorithm.

Core Component Diagram

Key Components:

Component	Type	Role
`Tokenizer`	Final class	Orchestrates tokenization algorithm, marked final to prevent inheritance
`TokenExtractor`	Abstract class	Extension point for defining token recognition patterns via regex
`Full`	Concrete extractor	Supports complete Galach syntax including tags, users, domains
`Text`	Concrete extractor	Supports text-focused subset: words, phrases, groups, basic operators
`Token`	Value object	Immutable representation of a lexical unit with position and type
`TokenSequence`	Value object	Immutable collection of tokens produced by tokenization

Sources: lib/Languages/Galach/README.md66-72 lib/Languages/Galach/README.md255-261 lib/Languages/Galach/README.md422-426

Tokenization Flow

The tokenization process transforms a query string into a TokenSequence through pattern matching. The Tokenizer iterates through the input string, applying regular expressions provided by the TokenExtractor to identify and extract tokens sequentially.

Tokenization Process Flow

The algorithm ensures that no input is considered invalid. When no pattern matches at the current position (such as an unclosed phrase delimiter "), a single character is extracted as a BAILOUT token, allowing processing to continue. The parser will later ignore these tokens and apply corrections.

Sources: lib/Languages/Galach/README.md115-126 lib/Languages/Galach/README.md230-239

Token Representation

Each Token is an immutable value object that encapsulates three key properties:

Property	Type	Description
`lexeme`	`string`	The actual text matched from the input
`position`	`int`	Byte offset in the input string where the token starts
`type`	`int`	Token type constant (`Tokenizer::TOKEN_*`)

The position property uses byte offsets rather than character offsets, which is important for handling multi-byte UTF-8 characters correctly. The type property corresponds to one of 11 predefined constants defined in the Tokenizer class.

Token Type Categories

Term tokens can carry additional semantic information. For example, Word and Phrase tokens may have an optional domain property (e.g., title:word), while GroupBegin tokens can also have a domain prefix (e.g., description:().

Sources: lib/Languages/Galach/README.md258-387 lib/Languages/Galach/SYNTAX.md1-66

TokenExtractor Architecture

The TokenExtractor abstract class is the primary extension point for customizing the lexical rules of the language. It defines which tokens are recognized and how they are extracted through regular expression patterns.

TokenExtractor Interface

Implementation Requirements:

getExpressionTypeMap(): Returns an associative array mapping regular expressions to Tokenizer::TOKEN_* constants. The order of patterns matters—they are applied in sequence, and the first match wins.
createTermToken(): Factory method that receives matched data and constructs a concrete Token instance. This method receives data from named capture groups in the regex, allowing extraction of semantic information (e.g., domain prefixes).
createGroupBeginToken() (optional): Similar factory for GROUP_BEGIN tokens, useful when groups can have domain prefixes or other custom properties.

The regex patterns use named capture groups to extract structured data from the matched text. For example, a pattern for domain-prefixed words might capture both the domain and the word separately.

Sources: lib/Languages/Galach/README.md241-253 lib/Languages/Galach/README.md392-426

Provided Implementations

Two TokenExtractor implementations are provided out of the box, serving both as production-ready components and as reference implementations for custom extensions.

Implementation Comparison

Feature	Full TokenExtractor	Text TokenExtractor
File Path	`lib/Languages/Galach/TokenExtractor/Full.php`	`lib/Languages/Galach/TokenExtractor/Text.php`
Word Terms	✓ With domain prefix	✓ Without domain prefix
Phrase Terms	✓ With domain prefix	✓ Without domain prefix
Tag Terms	✓ `#tag` syntax	✗ Not supported
User Terms	✓ `@user` syntax	✗ Not supported
Domain Prefixes	✓ `domain:term`	✗ Not supported
Logical Operators	✓ All forms	✓ All forms
Unary Operators	✓ `+` `-` `!`	✓ `+` `-` `!`
Grouping	✓ With domain prefix	✓ Without domain prefix
Whitespace	✓	✓
Use Case	Complete Galach syntax	Simple text search

Example Usage

The Full implementation recognizes all 11 token types and supports the complete Galach syntax including domain prefixes, tag terms (#tag), and user terms (@user). The Text implementation provides a simpler subset focused on text search with words, phrases, basic operators, and grouping, but without special syntax like tags, users, or domain prefixes.

Sources: lib/Languages/Galach/README.md422-426 lib/Languages/Galach/README.md66-72

Customization Through Extension

Developers can customize tokenization behavior by extending the TokenExtractor abstract class. This allows modification of:

Special characters and sequences: Change operators (AND, &&, OR, ||, NOT, !, +, -), delimiters ((, ), "), markers (@, #), or domain separator (:)
Language subset selection: Choose which token types to recognize by omitting their regex patterns from getExpressionTypeMap()
Custom term tokens: Implement custom Token subtypes with additional properties by returning them from createTermToken()

The Tokenizer class itself is marked final and not intended for extension. All customization happens through the TokenExtractor interface. The parser downstream only cares about token types (via the type property), making it oblivious to custom token implementations—they pass through transparently.

Extension Example Structure

When implementing custom term tokens, use named capture groups in regular expressions to extract structured data. Pass this data to the token constructor in createTermToken(). The token's properties will be available for custom processing in generators, while the parser treats all term tokens uniformly based on their type property.

Sources: lib/Languages/Galach/README.md241-253 lib/Languages/Galach/README.md392-420 lib/Languages/Galach/README.md428-435

Integration with Parsing

The output of tokenization—a TokenSequence instance—serves as the input to the parser. The parser operates on token types only, ignoring custom token properties during syntax analysis. This design provides clean separation between lexical analysis and syntax analysis phases.

Pipeline Integration

During parsing, only the type property of each token is examined to drive the shift-reduce algorithm. However, the resulting SyntaxTree nodes maintain references to the original Token objects with all their properties intact. This allows generators to access custom token properties when producing output, enabling semantic information extracted during tokenization to flow through to code generation.

Key behaviors:

WHITESPACE tokens are ignored by the parser (not processed, not included in the syntax tree)
BAILOUT tokens trigger correction type Parser::CORRECTION_BAILOUT_TOKEN_IGNORED
Term tokens are wrapped in Term nodes that reference the original token
Group delimiter tokens are referenced by Group nodes
Operator tokens are consumed to create operator nodes (e.g., LogicalAnd, Mandatory)

Sources: lib/Languages/Galach/README.md41-49 lib/Languages/Galach/README.md428-435

Refresh this wiki

URL: https://deepwiki.com/netgen/query-translator/4-tokenization

⇱ Tokenization | netgen/query-translator | DeepWiki