VOOZH about

URL: https://deepwiki.com/netgen/query-translator/4-tokenization

⇱ Tokenization | netgen/query-translator | DeepWiki


Loading...
Menu

Tokenization

Overview

Tokenization is the lexical analysis phase of the query translation pipeline. It transforms a raw query string into a sequence of structured Token objects, each representing the smallest syntactic unit of the Galach query language. This phase precedes parsing and is responsible for recognizing language constructs through pattern matching.

The tokenization system consists of three primary components: the Tokenizer class that performs the tokenization algorithm, the TokenExtractor abstract class that defines recognition patterns, and the resulting TokenSequence value object. The system is designed to be resilient—when no valid token can be recognized at a position, a special BAILOUT token is created to allow processing to continue.

For details on the specific token types recognized by the system, see Token Types. For information on implementing custom token recognition, see Token Extractors. For the algorithmic details of the tokenization process, see Tokenization Process.

Sources: lib/Languages/Galach/README.md33-74


Component Architecture

The tokenization system follows a strategy pattern where the Tokenizer class delegates token recognition rules to a TokenExtractor implementation. This separation allows customization of the language syntax without modifying the tokenization algorithm.

Core Component Diagram


Key Components:

ComponentTypeRole
TokenizerFinal classOrchestrates tokenization algorithm, marked final to prevent inheritance
TokenExtractorAbstract classExtension point for defining token recognition patterns via regex
FullConcrete extractorSupports complete Galach syntax including tags, users, domains
TextConcrete extractorSupports text-focused subset: words, phrases, groups, basic operators
TokenValue objectImmutable representation of a lexical unit with position and type
TokenSequenceValue objectImmutable collection of tokens produced by tokenization

Sources: lib/Languages/Galach/README.md66-72 lib/Languages/Galach/README.md255-261 lib/Languages/Galach/README.md422-426


Tokenization Flow

The tokenization process transforms a query string into a TokenSequence through pattern matching. The Tokenizer iterates through the input string, applying regular expressions provided by the TokenExtractor to identify and extract tokens sequentially.

Tokenization Process Flow


The algorithm ensures that no input is considered invalid. When no pattern matches at the current position (such as an unclosed phrase delimiter "), a single character is extracted as a BAILOUT token, allowing processing to continue. The parser will later ignore these tokens and apply corrections.

Sources: lib/Languages/Galach/README.md115-126 lib/Languages/Galach/README.md230-239


Token Representation

Each Token is an immutable value object that encapsulates three key properties:

PropertyTypeDescription
lexemestringThe actual text matched from the input
positionintByte offset in the input string where the token starts
typeintToken type constant (Tokenizer::TOKEN_*)

The position property uses byte offsets rather than character offsets, which is important for handling multi-byte UTF-8 characters correctly. The type property corresponds to one of 11 predefined constants defined in the Tokenizer class.

Token Type Categories


Term tokens can carry additional semantic information. For example, Word and Phrase tokens may have an optional domain property (e.g., title:word), while GroupBegin tokens can also have a domain prefix (e.g., description:().

Sources: lib/Languages/Galach/README.md258-387 lib/Languages/Galach/SYNTAX.md1-66


TokenExtractor Architecture

The TokenExtractor abstract class is the primary extension point for customizing the lexical rules of the language. It defines which tokens are recognized and how they are extracted through regular expression patterns.

TokenExtractor Interface


Implementation Requirements:

  1. getExpressionTypeMap(): Returns an associative array mapping regular expressions to Tokenizer::TOKEN_* constants. The order of patterns matters—they are applied in sequence, and the first match wins.

  2. createTermToken(): Factory method that receives matched data and constructs a concrete Token instance. This method receives data from named capture groups in the regex, allowing extraction of semantic information (e.g., domain prefixes).

  3. createGroupBeginToken() (optional): Similar factory for GROUP_BEGIN tokens, useful when groups can have domain prefixes or other custom properties.

The regex patterns use named capture groups to extract structured data from the matched text. For example, a pattern for domain-prefixed words might capture both the domain and the word separately.

Sources: lib/Languages/Galach/README.md241-253 lib/Languages/Galach/README.md392-426


Provided Implementations

Two TokenExtractor implementations are provided out of the box, serving both as production-ready components and as reference implementations for custom extensions.

Implementation Comparison

FeatureFull TokenExtractorText TokenExtractor
File Pathlib/Languages/Galach/TokenExtractor/Full.phplib/Languages/Galach/TokenExtractor/Text.php
Word Terms✓ With domain prefix✓ Without domain prefix
Phrase Terms✓ With domain prefix✓ Without domain prefix
Tag Terms#tag syntax✗ Not supported
User Terms@user syntax✗ Not supported
Domain Prefixesdomain:term✗ Not supported
Logical Operators✓ All forms✓ All forms
Unary Operators+ - !+ - !
Grouping✓ With domain prefix✓ Without domain prefix
Whitespace
Use CaseComplete Galach syntaxSimple text search

Example Usage


The Full implementation recognizes all 11 token types and supports the complete Galach syntax including domain prefixes, tag terms (#tag), and user terms (@user). The Text implementation provides a simpler subset focused on text search with words, phrases, basic operators, and grouping, but without special syntax like tags, users, or domain prefixes.

Sources: lib/Languages/Galach/README.md422-426 lib/Languages/Galach/README.md66-72


Customization Through Extension

Developers can customize tokenization behavior by extending the TokenExtractor abstract class. This allows modification of:

  1. Special characters and sequences: Change operators (AND, &&, OR, ||, NOT, !, +, -), delimiters ((, ), "), markers (@, #), or domain separator (:)

  2. Language subset selection: Choose which token types to recognize by omitting their regex patterns from getExpressionTypeMap()

  3. Custom term tokens: Implement custom Token subtypes with additional properties by returning them from createTermToken()

The Tokenizer class itself is marked final and not intended for extension. All customization happens through the TokenExtractor interface. The parser downstream only cares about token types (via the type property), making it oblivious to custom token implementations—they pass through transparently.

Extension Example Structure


When implementing custom term tokens, use named capture groups in regular expressions to extract structured data. Pass this data to the token constructor in createTermToken(). The token's properties will be available for custom processing in generators, while the parser treats all term tokens uniformly based on their type property.

Sources: lib/Languages/Galach/README.md241-253 lib/Languages/Galach/README.md392-420 lib/Languages/Galach/README.md428-435


Integration with Parsing

The output of tokenization—a TokenSequence instance—serves as the input to the parser. The parser operates on token types only, ignoring custom token properties during syntax analysis. This design provides clean separation between lexical analysis and syntax analysis phases.

Pipeline Integration


During parsing, only the type property of each token is examined to drive the shift-reduce algorithm. However, the resulting SyntaxTree nodes maintain references to the original Token objects with all their properties intact. This allows generators to access custom token properties when producing output, enabling semantic information extracted during tokenization to flow through to code generation.

Key behaviors:

  • WHITESPACE tokens are ignored by the parser (not processed, not included in the syntax tree)
  • BAILOUT tokens trigger correction type Parser::CORRECTION_BAILOUT_TOKEN_IGNORED
  • Term tokens are wrapped in Term nodes that reference the original token
  • Group delimiter tokens are referenced by Group nodes
  • Operator tokens are consumed to create operator nodes (e.g., LogicalAnd, Mandatory)

Sources: lib/Languages/Galach/README.md41-49 lib/Languages/Galach/README.md428-435