Last indexed: 13 February 2026 (50f4d4)

Tokenization Process

This document details the internal algorithm used by the Tokenizer class to convert raw query strings into sequences of Token objects. It explains how regex patterns from TokenExtractor implementations are applied, how named capture groups map to token types, how byte offsets are calculated for multi-byte UTF-8 characters, and how unrecognized input is handled through BAILOUT tokens.

For an overview of token types and their properties, see Token Types. For information about implementing custom TokenExtractor classes, see Token Extractors.

Tokenization Algorithm Overview

The Tokenizer class implements a regex-based scanning algorithm that processes the input string from left to right. The class accepts a TokenExtractor instance that provides the regex patterns, then applies these patterns iteratively to extract tokens until the entire input is consumed.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1240-1250

TokenExtractor Integration

The Tokenizer relies entirely on its injected TokenExtractor instance to define lexical rules. The extractor provides a compiled regex pattern with named capture groups, where each group name corresponds to a token type constant.

Component	Responsibility
`TokenExtractor`	Defines regex patterns with named capture groups
`Tokenizer`	Applies patterns iteratively, creates Token objects
`Token`	Value object representing a lexeme with type and position
`TokenSequence`	Container for all tokens plus original source string

The tokenizer calls TokenExtractor::extract(string $string, int $position) at each step, which returns either a matched token or null if no pattern matches at the current position.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1242-1243 tests/Galach/Tokenizer/TextTokenizerTest.php180-183

Named Capture Groups and Token Type Mapping

The TokenExtractor implementations construct regex patterns where each alternative is wrapped in a named capture group. The group name directly maps to a token type constant defined in the Tokenizer class.

For example, when the Full extractor encounters the string "AND" at a word boundary, the (?<LOGICAL_AND>\bAND\b) named group captures it, resulting in a token with type Tokenizer::TOKEN_LOGICAL_AND.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php521-528

Byte Offset Calculation for Multi-Byte Characters

The tokenizer tracks position as byte offsets, not character offsets, which is critical for UTF-8 strings where characters may occupy multiple bytes. This ensures that token positions can be used to extract substrings from the original source string using byte-based operations.

Multi-Byte Character Handling

The test suite explicitly verifies multi-byte character handling:

Test Input	Byte Length	Character Count
`'šđčćž'`	10	5
`🍳` (emoji)	4	1
Complex emoji sequence	Variable	1 (grapheme)

When calculating positions, the tokenizer uses strlen() (byte length) rather than mb_strlen() (character length), ensuring positions align with byte offsets in the original string.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php66-76

Whitespace and Position Tracking

The tokenizer tracks whitespace tokens and their byte positions, which is essential for accurate position reporting and source reconstruction.

The position field in each token represents the byte offset where the token's lexeme begins in the source string. This allows downstream components to:

Extract the original lexeme from the source: substr($source, $token->position, strlen($token->lexeme))
Highlight syntax in user interfaces by mapping tokens to character ranges
Report error positions accurately

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php102-107

BAILOUT Token Generation

When no regex pattern matches at the current position, the tokenizer generates a TOKEN_BAILOUT token for a single byte and advances by one byte. This ensures the tokenizer always makes progress and never enters an infinite loop, even with completely invalid input.

BAILOUT Handling Flow

BAILOUT Scenarios

The test suite demonstrates several scenarios that produce BAILOUT tokens:

Input	Tokens Produced	Explanation
`'word"'`	`[WordToken('word', 0), BAILOUT('"', 4)]`	Quote not part of phrase
`'one"two'`	`[WordToken('one', 0), BAILOUT('"', 3), WordToken('two', 4)]`	Quote mid-string
`'AND"'`	`[LOGICAL_AND('AND', 0), BAILOUT('"', 3)]`	After operator

BAILOUT tokens allow the parser to identify problematic input sections while continuing to tokenize the remainder of the string. The parser's correction system (see Error Handling and Corrections) can then handle these tokens appropriately.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1252-1338

Escape Sequence Processing

The tokenizer handles escape sequences (backslash-prefixed characters) as part of token extraction. When an escape sequence is encountered within a word, the TokenExtractor's regex patterns recognize it, and the resulting token stores both the original lexeme (with backslash) and the unescaped value.

Common escape sequences:

Input Sequence	Lexeme	Value	Token Type
`\+`	`'\+'`	`'+'`	Word
`\-`	`'\-'`	`'-'`	Word
`\!`	`'\!'`	`'!'`	Word
`\(`	`'\('`	`'('`	Word
`\)`	`'\)'`	`')'`	Word
`\\`	`'\\'`	`'\'`	Word
`\` (space)	`'\ '`	`' '`	Part of Word

The value property contains the unescaped text, while the lexeme preserves the original input including escape characters.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1013-1028 tests/Galach/Tokenizer/FullTokenizerTest.php110-113

TokenSequence Construction

After all tokens are extracted, the tokenizer constructs a TokenSequence value object that bundles the token array with the original source string.

The TokenSequence object provides:

tokens - Array of Token objects in order of appearance
source - Original input string for reference

This bundling ensures that downstream components (like the Parser) always have access to both the tokenized representation and the original source, enabling features like:

Error message generation with context
Syntax highlighting in user interfaces
Correction suggestion generation
Source reconstruction with modifications

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1247-1249

Domain Prefix Recognition

The Full tokenizer recognizes domain prefixes in the form domain: or domain.subdomain: that precede words, phrases, or groups. This recognition happens during token extraction, where the domain becomes part of the token's metadata.

The domain information is stored in specialized token classes:

Token Type	Domain Property	Example
`WordToken`	Constructor parameter	`new WordToken('domain:word', 0, 'domain', 'word')`
`PhraseToken`	Constructor parameter	`new PhraseToken('domain:"phrase"', 0, 'domain', '"', 'phrase')`
`GroupBegin`	Constructor parameter	`new GroupBegin('domain:(', 0, '(', 'domain')`

The Text tokenizer does not support domain recognition - it treats colons as regular word characters.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php474-518 tests/Galach/Tokenizer/TextTokenizerTest.php143-172

Implementation Summary

The tokenization process can be summarized as:

Initialization - Create empty token array, set position to 0
Extraction Loop - While position < string length:
- Call TokenExtractor::extract() at current position
- If match found: create appropriate token, add to array, advance by lexeme length
- If no match: create BAILOUT token for one byte, advance by 1
Completion - Return TokenSequence with tokens and source

This algorithm ensures:

All input is consumed (no infinite loops)
Multi-byte characters are handled correctly
Invalid input produces BAILOUT tokens rather than failures
Position tracking enables accurate error reporting
Source string is preserved for downstream use

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1240-1250

Refresh this wiki

URL: https://deepwiki.com/netgen/query-translator/4.3-tokenization-process