Last indexed: 13 February 2026 (50f4d4)

Query Processing Pipeline

This document describes the complete query processing pipeline that transforms an input query string into backend-specific output. The pipeline follows a classic compiler architecture with three distinct phases: tokenization (lexical analysis), parsing (syntax analysis), and generation (code generation). Each phase produces an intermediate representation that serves as input to the next phase, ensuring clean separation of concerns.

For details on individual token types, see Token Types. For tokenization internals, see Tokenization Process. For parser algorithms, see Parser. For generation details, see Query Generation.

Sources: lib/Languages/Galach/README.md32-50 lib/Languages/Galach/Parser.php1-644

Pipeline Overview

The query translation system processes queries through three sequential phases, with intermediate representations connecting each phase:

Key Data Structures:

Structure	Type	Purpose	Produced By
Query String	`string`	Raw user input	External input
`TokenSequence`	Value object	Array of tokens + source string	`Tokenizer`
`SyntaxTree`	Value object	Root node + corrections + token sequence	`Parser`
Output String	`string`	Backend-specific query	Generator

Sources: lib/Languages/Galach/README.md42-50 lib/Values/TokenSequence.php1-36 tests/Galach/IntegrationTest.php28-34

Phase 1: Tokenization

The tokenization phase converts the input query string into a TokenSequence, which contains an array of Token objects and preserves the original source string.

The Tokenizer uses a TokenExtractor to define lexical rules via regular expressions. Two implementations are provided:

Full TokenExtractor: Supports complete Galach syntax including tags, users, and domain prefixes
Text TokenExtractor: Supports a simplified subset for basic text search

The tokenization process is fault-tolerant: when no token can be extracted at the current position, a single character is read as a TOKEN_BAILOUT token. This ensures tokenization always completes successfully.

Sources: lib/Languages/Galach/README.md66-73 lib/Values/TokenSequence.php6-35 lib/Languages/Galach/README.md121-125

Phase 2: Parsing

The parsing phase constructs a SyntaxTree from the TokenSequence. The parser implements a shift-reduce algorithm that processes tokens sequentially, building a hierarchical tree structure representing the query's logical structure.

The parser processes tokens through shift and reduce operations:

Shift: Read the next token and either push it onto the stack or convert it to a Node
Reduce: Combine stack elements into higher-level Node structures according to grammar rules

The output SyntaxTree contains three components:

root: The top-level Query node containing the parse tree
tokenSequence: The original TokenSequence (preserved for reference)
corrections: An array of Correction objects documenting any syntax errors that were corrected

Sources: lib/Languages/Galach/Parser.php159-174 lib/Languages/Galach/Parser.php176-207 lib/Languages/Galach/README.md75-84

Phase 3: Generation

The generation phase traverses the SyntaxTree to produce backend-specific output. Three generators are provided out of the box:

All generators use the Visitor pattern to traverse the tree. Each Node type has corresponding visitor implementations that handle conversion to the target format. Generators can reuse common visitor components while providing backend-specific escaping and field mapping.

Sources: lib/Languages/Galach/README.md86-111 lib/Languages/Galach/README.md437-473

Complete Flow Example

The following code demonstrates the complete pipeline processing a query string:

Data at each stage:

Stage	Type	Content Example
Input	`string`	`'one AND two'`
After Tokenization	`TokenSequence`	`Token[WordToken('one'), Token(LOGICAL_AND), WordToken('two')]`
After Parsing	`SyntaxTree`	`Query[LogicalAnd(Term(WordToken('one')), Term(WordToken('two')))]`
After Generation	`string`	`'one AND two'` (or backend-specific format)

Sources: lib/Languages/Galach/README.md53-112 tests/Galach/IntegrationTest.php69-85

Error Handling Philosophy

The pipeline implements a "no input is invalid" philosophy. Every query string, regardless of syntax errors, produces a valid output:

Correction Types:

The parser defines 10 correction type constants that document how malformed input is handled:

Constant	Example Input	Corrected Result
`CORRECTION_ADJACENT_UNARY_OPERATOR_PRECEDING_OPERATOR_IGNORED`	`++one`	`+one`
`CORRECTION_UNARY_OPERATOR_MISSING_OPERAND_IGNORED`	`one NOT`	`one`
`CORRECTION_BINARY_OPERATOR_MISSING_LEFT_OPERAND_IGNORED`	`AND two`	`two`
`CORRECTION_BINARY_OPERATOR_MISSING_RIGHT_OPERAND_IGNORED`	`one AND`	`one`
`CORRECTION_BINARY_OPERATOR_FOLLOWING_OPERATOR_IGNORED`	`one AND OR two`	`one two`
`CORRECTION_LOGICAL_NOT_OPERATORS_PRECEDING_PREFERENCE_IGNORED`	`NOT +one`	`+one`
`CORRECTION_EMPTY_GROUP_IGNORED`	`one AND ()`	`one`
`CORRECTION_UNMATCHED_GROUP_LEFT_DELIMITER_IGNORED`	`one ( AND two`	`one AND two`
`CORRECTION_UNMATCHED_GROUP_RIGHT_DELIMITER_IGNORED`	`one AND ) two`	`one AND two`
`CORRECTION_BAILOUT_TOKEN_IGNORED`	`one " two`	`one two`

Each Correction object contains:

type: The correction type constant
tokens: Array of tokens that were affected by the correction

This information enables UI features like syntax highlighting, error indicators, and user feedback without blocking query execution.

Sources: lib/Languages/Galach/Parser.php26-76 lib/Languages/Galach/README.md114-240 lib/Values/Correction.php1-38

Data Flow and State Management

The pipeline maintains clear separation between phases through immutable value objects:

Key Characteristics:

Immutability: TokenSequence, Token, Node, and SyntaxTree are value objects that don't change after creation
Preservation: SyntaxTree maintains a reference to the original TokenSequence, allowing generators to access token details
Error Recovery: Correction objects are collected during parsing but don't prevent SyntaxTree creation
Single Direction: Data flows in one direction through the pipeline - there's no backtracking or iteration

Internal Parser State:

The parser maintains mutable state during parsing:

State Component	Type	Purpose
`$tokens`	`Token[]`	Input tokens being processed (modified during parsing)
`$stack`	`SplStack`	Stack for shift-reduce operations
`$corrections`	`Correction[]`	Accumulated corrections

These are initialized in Parser::init() lib/Languages/Galach/Parser.php453-459 and accessed through Parser::shift() lib/Languages/Galach/Parser.php176-182 and Parser::reduce() lib/Languages/Galach/Parser.php184-207

Sources: lib/Values/TokenSequence.php1-36 lib/Languages/Galach/Parser.php138-174 lib/Values/Correction.php1-38

Pipeline Extensibility

The pipeline provides extension points at each phase while maintaining the overall flow:

Phase 1 Customization: Implement TokenExtractor abstract class to change how tokens are recognized. The Tokenizer remains unchanged.

Phase 2 Customization: The Parser class is marked final and not intended for extension. It processes any Token type based only on the token's type field, making it agnostic to custom token implementations.

Phase 3 Customization: Implement custom generators to produce new output formats. Use the Visitor pattern to traverse the SyntaxTree.

Sources: lib/Languages/Galach/README.md241-427 lib/Languages/Galach/README.md428-473

Refresh this wiki

URL: https://deepwiki.com/netgen/query-translator/2.2-query-processing-pipeline