VOOZH about

URL: https://deepwiki.com/netgen/query-translator/2.2-query-processing-pipeline

⇱ Query Processing Pipeline | netgen/query-translator | DeepWiki


Loading...
Menu

Query Processing Pipeline

This document describes the complete query processing pipeline that transforms an input query string into backend-specific output. The pipeline follows a classic compiler architecture with three distinct phases: tokenization (lexical analysis), parsing (syntax analysis), and generation (code generation). Each phase produces an intermediate representation that serves as input to the next phase, ensuring clean separation of concerns.

For details on individual token types, see Token Types. For tokenization internals, see Tokenization Process. For parser algorithms, see Parser. For generation details, see Query Generation.

Sources: lib/Languages/Galach/README.md32-50 lib/Languages/Galach/Parser.php1-644

Pipeline Overview

The query translation system processes queries through three sequential phases, with intermediate representations connecting each phase:


Key Data Structures:

StructureTypePurposeProduced By
Query StringstringRaw user inputExternal input
TokenSequenceValue objectArray of tokens + source stringTokenizer
SyntaxTreeValue objectRoot node + corrections + token sequenceParser
Output StringstringBackend-specific queryGenerator

Sources: lib/Languages/Galach/README.md42-50 lib/Values/TokenSequence.php1-36 tests/Galach/IntegrationTest.php28-34

Phase 1: Tokenization

The tokenization phase converts the input query string into a TokenSequence, which contains an array of Token objects and preserves the original source string.


The Tokenizer uses a TokenExtractor to define lexical rules via regular expressions. Two implementations are provided:

  • Full TokenExtractor: Supports complete Galach syntax including tags, users, and domain prefixes
  • Text TokenExtractor: Supports a simplified subset for basic text search

The tokenization process is fault-tolerant: when no token can be extracted at the current position, a single character is read as a TOKEN_BAILOUT token. This ensures tokenization always completes successfully.

Sources: lib/Languages/Galach/README.md66-73 lib/Values/TokenSequence.php6-35 lib/Languages/Galach/README.md121-125

Phase 2: Parsing

The parsing phase constructs a SyntaxTree from the TokenSequence. The parser implements a shift-reduce algorithm that processes tokens sequentially, building a hierarchical tree structure representing the query's logical structure.


The parser processes tokens through shift and reduce operations:

  • Shift: Read the next token and either push it onto the stack or convert it to a Node
  • Reduce: Combine stack elements into higher-level Node structures according to grammar rules

The output SyntaxTree contains three components:

  1. root: The top-level Query node containing the parse tree
  2. tokenSequence: The original TokenSequence (preserved for reference)
  3. corrections: An array of Correction objects documenting any syntax errors that were corrected

Sources: lib/Languages/Galach/Parser.php159-174 lib/Languages/Galach/Parser.php176-207 lib/Languages/Galach/README.md75-84

Phase 3: Generation

The generation phase traverses the SyntaxTree to produce backend-specific output. Three generators are provided out of the box:


All generators use the Visitor pattern to traverse the tree. Each Node type has corresponding visitor implementations that handle conversion to the target format. Generators can reuse common visitor components while providing backend-specific escaping and field mapping.

Sources: lib/Languages/Galach/README.md86-111 lib/Languages/Galach/README.md437-473

Complete Flow Example

The following code demonstrates the complete pipeline processing a query string:


Data at each stage:

StageTypeContent Example
Inputstring'one AND two'
After TokenizationTokenSequenceToken[WordToken('one'), Token(LOGICAL_AND), WordToken('two')]
After ParsingSyntaxTreeQuery[LogicalAnd(Term(WordToken('one')), Term(WordToken('two')))]
After Generationstring'one AND two' (or backend-specific format)

Sources: lib/Languages/Galach/README.md53-112 tests/Galach/IntegrationTest.php69-85

Error Handling Philosophy

The pipeline implements a "no input is invalid" philosophy. Every query string, regardless of syntax errors, produces a valid output:


Correction Types:

The parser defines 10 correction type constants that document how malformed input is handled:

ConstantExample InputCorrected Result
CORRECTION_ADJACENT_UNARY_OPERATOR_PRECEDING_OPERATOR_IGNORED++one+one
CORRECTION_UNARY_OPERATOR_MISSING_OPERAND_IGNOREDone NOTone
CORRECTION_BINARY_OPERATOR_MISSING_LEFT_OPERAND_IGNOREDAND twotwo
CORRECTION_BINARY_OPERATOR_MISSING_RIGHT_OPERAND_IGNOREDone ANDone
CORRECTION_BINARY_OPERATOR_FOLLOWING_OPERATOR_IGNOREDone AND OR twoone two
CORRECTION_LOGICAL_NOT_OPERATORS_PRECEDING_PREFERENCE_IGNOREDNOT +one+one
CORRECTION_EMPTY_GROUP_IGNOREDone AND ()one
CORRECTION_UNMATCHED_GROUP_LEFT_DELIMITER_IGNOREDone ( AND twoone AND two
CORRECTION_UNMATCHED_GROUP_RIGHT_DELIMITER_IGNOREDone AND ) twoone AND two
CORRECTION_BAILOUT_TOKEN_IGNOREDone " twoone two

Each Correction object contains:

  • type: The correction type constant
  • tokens: Array of tokens that were affected by the correction

This information enables UI features like syntax highlighting, error indicators, and user feedback without blocking query execution.

Sources: lib/Languages/Galach/Parser.php26-76 lib/Languages/Galach/README.md114-240 lib/Values/Correction.php1-38

Data Flow and State Management

The pipeline maintains clear separation between phases through immutable value objects:


Key Characteristics:

  1. Immutability: TokenSequence, Token, Node, and SyntaxTree are value objects that don't change after creation
  2. Preservation: SyntaxTree maintains a reference to the original TokenSequence, allowing generators to access token details
  3. Error Recovery: Correction objects are collected during parsing but don't prevent SyntaxTree creation
  4. Single Direction: Data flows in one direction through the pipeline - there's no backtracking or iteration

Internal Parser State:

The parser maintains mutable state during parsing:

State ComponentTypePurpose
$tokensToken[]Input tokens being processed (modified during parsing)
$stackSplStackStack for shift-reduce operations
$correctionsCorrection[]Accumulated corrections

These are initialized in Parser::init() lib/Languages/Galach/Parser.php453-459 and accessed through Parser::shift() lib/Languages/Galach/Parser.php176-182 and Parser::reduce() lib/Languages/Galach/Parser.php184-207

Sources: lib/Values/TokenSequence.php1-36 lib/Languages/Galach/Parser.php138-174 lib/Values/Correction.php1-38

Pipeline Extensibility

The pipeline provides extension points at each phase while maintaining the overall flow:


Phase 1 Customization: Implement TokenExtractor abstract class to change how tokens are recognized. The Tokenizer remains unchanged.

Phase 2 Customization: The Parser class is marked final and not intended for extension. It processes any Token type based only on the token's type field, making it agnostic to custom token implementations.

Phase 3 Customization: Implement custom generators to produce new output formats. Use the Visitor pattern to traverse the SyntaxTree.

Sources: lib/Languages/Galach/README.md241-427 lib/Languages/Galach/README.md428-473