VOOZH about

URL: https://deepwiki.com/netgen/query-translator/4.3-tokenization-process

⇱ Tokenization Process | netgen/query-translator | DeepWiki


Loading...
Menu

Tokenization Process

This document details the internal algorithm used by the Tokenizer class to convert raw query strings into sequences of Token objects. It explains how regex patterns from TokenExtractor implementations are applied, how named capture groups map to token types, how byte offsets are calculated for multi-byte UTF-8 characters, and how unrecognized input is handled through BAILOUT tokens.

For an overview of token types and their properties, see Token Types. For information about implementing custom TokenExtractor classes, see Token Extractors.


Tokenization Algorithm Overview

The Tokenizer class implements a regex-based scanning algorithm that processes the input string from left to right. The class accepts a TokenExtractor instance that provides the regex patterns, then applies these patterns iteratively to extract tokens until the entire input is consumed.


Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1240-1250


TokenExtractor Integration

The Tokenizer relies entirely on its injected TokenExtractor instance to define lexical rules. The extractor provides a compiled regex pattern with named capture groups, where each group name corresponds to a token type constant.

ComponentResponsibility
TokenExtractorDefines regex patterns with named capture groups
TokenizerApplies patterns iteratively, creates Token objects
TokenValue object representing a lexeme with type and position
TokenSequenceContainer for all tokens plus original source string

The tokenizer calls TokenExtractor::extract(string $string, int $position) at each step, which returns either a matched token or null if no pattern matches at the current position.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1242-1243 tests/Galach/Tokenizer/TextTokenizerTest.php180-183


Named Capture Groups and Token Type Mapping

The TokenExtractor implementations construct regex patterns where each alternative is wrapped in a named capture group. The group name directly maps to a token type constant defined in the Tokenizer class.


For example, when the Full extractor encounters the string "AND" at a word boundary, the (?<LOGICAL_AND>\bAND\b) named group captures it, resulting in a token with type Tokenizer::TOKEN_LOGICAL_AND.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php521-528


Byte Offset Calculation for Multi-Byte Characters

The tokenizer tracks position as byte offsets, not character offsets, which is critical for UTF-8 strings where characters may occupy multiple bytes. This ensures that token positions can be used to extract substrings from the original source string using byte-based operations.

Multi-Byte Character Handling


The test suite explicitly verifies multi-byte character handling:

Test InputByte LengthCharacter CountToken Position
'šđčćž'1050
🍳 (emoji)410
Complex emoji sequenceVariable1 (grapheme)0

When calculating positions, the tokenizer uses strlen() (byte length) rather than mb_strlen() (character length), ensuring positions align with byte offsets in the original string.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php66-76


Whitespace and Position Tracking

The tokenizer tracks whitespace tokens and their byte positions, which is essential for accurate position reporting and source reconstruction.


The position field in each token represents the byte offset where the token's lexeme begins in the source string. This allows downstream components to:

  1. Extract the original lexeme from the source: substr($source, $token->position, strlen($token->lexeme))
  2. Highlight syntax in user interfaces by mapping tokens to character ranges
  3. Report error positions accurately

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php102-107


BAILOUT Token Generation

When no regex pattern matches at the current position, the tokenizer generates a TOKEN_BAILOUT token for a single byte and advances by one byte. This ensures the tokenizer always makes progress and never enters an infinite loop, even with completely invalid input.

BAILOUT Handling Flow


BAILOUT Scenarios

The test suite demonstrates several scenarios that produce BAILOUT tokens:

InputTokens ProducedExplanation
'word"'[WordToken('word', 0), BAILOUT('"', 4)]Quote not part of phrase
'one"two'[WordToken('one', 0), BAILOUT('"', 3), WordToken('two', 4)]Quote mid-string
'AND"'[LOGICAL_AND('AND', 0), BAILOUT('"', 3)]After operator

BAILOUT tokens allow the parser to identify problematic input sections while continuing to tokenize the remainder of the string. The parser's correction system (see Error Handling and Corrections) can then handle these tokens appropriately.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1252-1338


Escape Sequence Processing

The tokenizer handles escape sequences (backslash-prefixed characters) as part of token extraction. When an escape sequence is encountered within a word, the TokenExtractor's regex patterns recognize it, and the resulting token stores both the original lexeme (with backslash) and the unescaped value.


Common escape sequences:

Input SequenceLexemeValueToken Type
\+'\+''+'Word
\-'\-''-'Word
\!'\!''!'Word
\('\(''('Word
\)'\)'')'Word
\\'\\''\'Word
\ (space)'\ '' 'Part of Word

The value property contains the unescaped text, while the lexeme preserves the original input including escape characters.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1013-1028 tests/Galach/Tokenizer/FullTokenizerTest.php110-113


TokenSequence Construction

After all tokens are extracted, the tokenizer constructs a TokenSequence value object that bundles the token array with the original source string.


The TokenSequence object provides:

  • tokens - Array of Token objects in order of appearance
  • source - Original input string for reference

This bundling ensures that downstream components (like the Parser) always have access to both the tokenized representation and the original source, enabling features like:

  • Error message generation with context
  • Syntax highlighting in user interfaces
  • Correction suggestion generation
  • Source reconstruction with modifications

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1247-1249


Domain Prefix Recognition

The Full tokenizer recognizes domain prefixes in the form domain: or domain.subdomain: that precede words, phrases, or groups. This recognition happens during token extraction, where the domain becomes part of the token's metadata.


The domain information is stored in specialized token classes:

Token TypeDomain PropertyExample
WordTokenConstructor parameternew WordToken('domain:word', 0, 'domain', 'word')
PhraseTokenConstructor parameternew PhraseToken('domain:"phrase"', 0, 'domain', '"', 'phrase')
GroupBeginConstructor parameternew GroupBegin('domain:(', 0, '(', 'domain')

The Text tokenizer does not support domain recognition - it treats colons as regular word characters.

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php474-518 tests/Galach/Tokenizer/TextTokenizerTest.php143-172


Implementation Summary

The tokenization process can be summarized as:

  1. Initialization - Create empty token array, set position to 0
  2. Extraction Loop - While position < string length:
    • Call TokenExtractor::extract() at current position
    • If match found: create appropriate token, add to array, advance by lexeme length
    • If no match: create BAILOUT token for one byte, advance by 1
  3. Completion - Return TokenSequence with tokens and source

This algorithm ensures:

  • All input is consumed (no infinite loops)
  • Multi-byte characters are handled correctly
  • Invalid input produces BAILOUT tokens rather than failures
  • Position tracking enables accurate error reporting
  • Source string is preserved for downstream use

Sources: tests/Galach/Tokenizer/FullTokenizerTest.php1240-1250