VOOZH about

URL: https://deepwiki.com/netgen/query-translator/2-architecture

⇱ Architecture | netgen/query-translator | DeepWiki


Loading...
Menu

Architecture

Purpose and Scope

This document describes the overall system architecture of the Query Translator library, which follows a classic compiler design pattern. The architecture is organized into three distinct phases: lexical analysis, syntax analysis, and code generation. At the center of this design is the SyntaxTree, which serves as an intermediate representation enabling translation to multiple backend formats.

For details on specific components, see:

Compiler Architecture Pattern

The Query Translator follows a textbook compiler architecture with clear separation of concerns across three phases:


Phase 1: Lexical Analysis

  • Converts raw query string into a sequence of tokens
  • Implemented by Tokenizer class using TokenExtractor patterns
  • Produces TokenSequence containing Token objects

Phase 2: Syntax Analysis

  • Parses token sequence into hierarchical structure
  • Implemented by Parser class using shift-reduce algorithm
  • Produces SyntaxTree containing Node hierarchy
  • Applies corrections for invalid input

Phase 3: Code Generation

  • Traverses syntax tree to generate backend-specific output
  • Implemented by generator classes using visitor pattern
  • Produces query strings for Solr, Elasticsearch, or native format

Sources: README.md10-15 lib/Languages/Galach/README.md33-50

System Layers

Layer Overview


Sources: README.md10-23 lib/Languages/Galach/README.md33-50

Lexical Analysis Layer

The lexical analysis layer transforms raw input strings into structured token sequences.

ComponentTypeRole
TokenExtractorAbstract ClassDefines regex patterns for token recognition
TokenExtractor\FullConcrete ImplementationFull Galach syntax support (tags, users, domains)
TokenExtractor\TextConcrete ImplementationSimplified text-only syntax subset
TokenizerFinal ClassExecutes tokenization using extractor patterns
TokenSequenceValue ObjectHolds extracted tokens and source string
TokenValue ObjectRepresents smallest syntactic unit

The Tokenizer class lib/Languages/Galach/Tokenizer.php is marked final and depends on TokenExtractor for customization. Token types are defined as bitmask constants:

  • TOKEN_TERM - Word, Phrase, Tag, User tokens
  • TOKEN_WHITESPACE - Whitespace characters
  • TOKEN_LOGICAL_AND, TOKEN_LOGICAL_OR, TOKEN_LOGICAL_NOT - Binary operators
  • TOKEN_MANDATORY, TOKEN_PROHIBITED - Unary operators
  • TOKEN_GROUP_BEGIN, TOKEN_GROUP_END - Grouping delimiters
  • TOKEN_BAILOUT - Unrecognized character sequences

Sources: lib/Languages/Galach/Tokenizer.php lib/Values/TokenSequence.php1-36 lib/Languages/Galach/README.md255-260

Syntax Analysis Layer

The syntax analysis layer constructs a hierarchical representation from the flat token sequence.


Parser Implementation: The Parser class lib/Languages/Galach/Parser.php24 implements a shift-reduce parsing algorithm:

  1. Shift Phase: Reads tokens from TokenSequence and pushes them or creates nodes
  2. Reduce Phase: Combines stack elements into higher-level nodes based on reduction rules
  3. Correction Phase: Applies corrections for malformed input while preserving intent

The parser maintains internal state using SplStack lib/Languages/Galach/Parser.php150 and never rejects input - all corrections are logged in the resulting SyntaxTree.

Node Hierarchy: The syntax tree contains typed nodes representing query structure:

Node TypePurpose
QueryRoot node containing top-level elements
TermLeaf node wrapping a term token
GroupGrouped expression with left/right delimiters
LogicalAndBinary AND operator with left/right operands
LogicalOrBinary OR operator with left/right operands
LogicalNotUnary NOT operator with operand
MandatoryUnary + operator with operand
ProhibitedUnary - operator with operand

Sources: lib/Languages/Galach/Parser.php1-644 lib/Languages/Galach/README.md114-120

Code Generation Layer

The code generation layer traverses the syntax tree to produce backend-specific query strings.


The visitor pattern enables polymorphic processing of different node types. Each generator composes an Aggregate visitor lib/Languages/Galach/Generators/Common/Aggregate.php that dispatches to specialized visitors based on node class and token type.

Generator Classes:

Sources: lib/Languages/Galach/README.md437-473 lib/Languages/Galach/README.md92-103

Central Intermediate Representation

SyntaxTree Structure

The SyntaxTree serves as the pivot point between parsing and generation, decoupling input processing from output formatting.


Key Properties:

PropertyTypePurpose
rootNodeNodeRoot of the hierarchical node tree
tokenSequenceTokenSequenceReference to original tokens and source string
correctionsCorrection[]Array of applied corrections during parsing

The SyntaxTree maintains a reference to the original TokenSequence, allowing generators to access:

  • Original source string for position information
  • Token lexemes for escaping decisions
  • Token metadata (domain prefixes, quote types)

Sources: lib/Languages/Galach/Parser.php173 lib/Values/Correction.php1-38

Error Handling Through Corrections

The correction system ensures all input produces valid output while preserving error information.


Correction Types defined in Parser class constants lib/Languages/Galach/Parser.php27-76:

ConstantScenario
CORRECTION_ADJACENT_UNARY_OPERATOR_PRECEDING_OPERATOR_IGNORED++one+one
CORRECTION_UNARY_OPERATOR_MISSING_OPERAND_IGNOREDone NOTone
CORRECTION_BINARY_OPERATOR_MISSING_LEFT_OPERAND_IGNOREDAND twotwo
CORRECTION_BINARY_OPERATOR_MISSING_RIGHT_OPERAND_IGNOREDone ANDone
CORRECTION_BINARY_OPERATOR_FOLLOWING_OPERATOR_IGNOREDone AND OR twoone two
CORRECTION_LOGICAL_NOT_OPERATORS_PRECEDING_PREFERENCE_IGNOREDNOT +one+one
CORRECTION_EMPTY_GROUP_IGNOREDone AND ()one
CORRECTION_UNMATCHED_GROUP_LEFT_DELIMITER_IGNOREDone ( AND twoone AND two
CORRECTION_UNMATCHED_GROUP_RIGHT_DELIMITER_IGNOREDone ) AND twoone AND two
CORRECTION_BAILOUT_TOKEN_IGNOREDone " twoone two

Each Correction instance lib/Values/Correction.php11 contains:

  • type: Integer constant identifying the correction type
  • tokens: Array of tokens affected by the correction

This enables UI features like syntax highlighting, error feedback, and input cleanup suggestions.

Sources: lib/Languages/Galach/Parser.php27-76 lib/Languages/Galach/README.md114-240 lib/Values/Correction.php1-38

Multi-Backend Support Architecture

One Tree, Multiple Outputs

The architecture's key strength is generating multiple backend formats from a single parsed representation.


This design provides several benefits:

  1. Parse Once: Expensive parsing happens once regardless of output targets
  2. Consistent Semantics: All backends receive identical query semantics
  3. Correction Reuse: Error corrections apply uniformly across backends
  4. Query Analysis: Applications can inspect SyntaxTree before generation

Sources: README.md10-15 lib/Languages/Galach/README.md37-50

Backend-Specific Customization

While the syntax tree is shared, generators customize output through:


Visitor Reuse: The ExtendedDisMax and QueryString generators share most visitor implementations through the Lucene\Common namespace, differing only in:

  • Word token escaping rules
  • Field mapping configuration
  • Special character handling

Escaping Strategies: Each generator implements backend-specific character escaping:

  • Native: Escapes Galach syntax characters (\, +, -, !, (, ), ", #, @, :, space)
  • ExtendedDisMax: Escapes Solr special characters (\, +, -, *, ?)
  • QueryString: Escapes Elasticsearch special characters (ExtendedDisMax set + =, >, <)

Field Mapping: Lucene-based generators accept optional field mapping arrays to translate domain prefixes to backend field names.

Sources: lib/Languages/Galach/README.md437-458

Data Flow Through the System

Complete Processing Pipeline


Step-by-Step Flow:

  1. Input: User provides query string
  2. Token Extraction: TokenExtractor provides regex patterns to Tokenizer
  3. Tokenization: Tokenizer processes string, creates Token objects, produces TokenSequence
  4. Parsing: Parser reads tokens, performs shift-reduce operations, applies corrections
  5. Tree Construction: Parser builds hierarchical Node structure, creates SyntaxTree
  6. Generation: Generator traverses tree, dispatches to visitors, builds output string
  7. Output: Formatted query string ready for backend consumption

Sources: lib/Languages/Galach/README.md40-50 lib/Languages/Galach/README.md92-112

Token and Node Type Mapping

The transformation from tokens to nodes follows specific patterns:

Token TypeNode TypeTransformation
TOKEN_TERMTermWrapped directly by shiftTerm() lib/Languages/Galach/Parser.php285-288
TOKEN_MANDATORYMandatoryApplied to operand by reducePreference() lib/Languages/Galach/Parser.php307-320
TOKEN_PROHIBITEDProhibitedApplied to operand by reducePreference() lib/Languages/Galach/Parser.php307-320
TOKEN_LOGICAL_NOTLogicalNotApplied to operand by reduceLogicalNot() lib/Languages/Galach/Parser.php322-335
TOKEN_LOGICAL_ANDLogicalAndCombines operands by reduceLogicalAnd() lib/Languages/Galach/Parser.php349-359
TOKEN_LOGICAL_ORLogicalOrCombines operands by reduceLogicalOr() lib/Languages/Galach/Parser.php369-391
TOKEN_GROUP_BEGIN + TOKEN_GROUP_ENDGroupCreated by shiftGroupEnd(), populated by reduceGroup() lib/Languages/Galach/Parser.php295-415

The parser uses named shift methods mapped to token types lib/Languages/Galach/Parser.php89-101 and reduction methods organized by node type lib/Languages/Galach/Parser.php103-136

Sources: lib/Languages/Galach/Parser.php89-136 lib/Languages/Galach/Parser.php176-415

Extension Points

The architecture provides four primary extension points for customization:


TokenExtractor: Extend to define custom token recognition patterns, special characters, or language subsets. See Customization and Extension for details.

Visitor Pattern: Implement custom visitors to control node processing and output formatting. Visitors can be mixed and matched using Aggregate dispatcher.

Field Mapping: Configure domain-to-field translation for Lucene-based generators without subclassing.

Generator: Implement complete custom generators for new backends or output formats using existing visitors or custom implementations.

Sources: lib/Languages/Galach/README.md241-427 README.md31-45