Last indexed: 13 February 2026 (50f4d4)

Architecture

Purpose and Scope

This document describes the overall system architecture of the Query Translator library, which follows a classic compiler design pattern. The architecture is organized into three distinct phases: lexical analysis, syntax analysis, and code generation. At the center of this design is the SyntaxTree, which serves as an intermediate representation enabling translation to multiple backend formats.

For details on specific components, see:

Core building blocks: Core Components
Complete processing flow: Query Processing Pipeline
The Galach language implementation: The Galach Language
Code generation mechanisms: Query Generation

Compiler Architecture Pattern

The Query Translator follows a textbook compiler architecture with clear separation of concerns across three phases:

Phase 1: Lexical Analysis

Converts raw query string into a sequence of tokens
Implemented by Tokenizer class using TokenExtractor patterns
Produces TokenSequence containing Token objects

Phase 2: Syntax Analysis

Parses token sequence into hierarchical structure
Implemented by Parser class using shift-reduce algorithm
Produces SyntaxTree containing Node hierarchy
Applies corrections for invalid input

Phase 3: Code Generation

Traverses syntax tree to generate backend-specific output
Implemented by generator classes using visitor pattern
Produces query strings for Solr, Elasticsearch, or native format

Sources: README.md10-15 lib/Languages/Galach/README.md33-50

System Layers

Layer Overview

Sources: README.md10-23 lib/Languages/Galach/README.md33-50

Lexical Analysis Layer

The lexical analysis layer transforms raw input strings into structured token sequences.

Component	Type	Role
`TokenExtractor`	Abstract Class	Defines regex patterns for token recognition
`TokenExtractor\Full`	Concrete Implementation	Full Galach syntax support (tags, users, domains)
`TokenExtractor\Text`	Concrete Implementation	Simplified text-only syntax subset
`Tokenizer`	Final Class	Executes tokenization using extractor patterns
`TokenSequence`	Value Object	Holds extracted tokens and source string
`Token`	Value Object	Represents smallest syntactic unit

The Tokenizer class lib/Languages/Galach/Tokenizer.php is marked final and depends on TokenExtractor for customization. Token types are defined as bitmask constants:

TOKEN_TERM - Word, Phrase, Tag, User tokens
TOKEN_WHITESPACE - Whitespace characters
TOKEN_LOGICAL_AND, TOKEN_LOGICAL_OR, TOKEN_LOGICAL_NOT - Binary operators
TOKEN_MANDATORY, TOKEN_PROHIBITED - Unary operators
TOKEN_GROUP_BEGIN, TOKEN_GROUP_END - Grouping delimiters
TOKEN_BAILOUT - Unrecognized character sequences

Sources: lib/Languages/Galach/Tokenizer.php lib/Values/TokenSequence.php1-36 lib/Languages/Galach/README.md255-260

Syntax Analysis Layer

The syntax analysis layer constructs a hierarchical representation from the flat token sequence.

Parser Implementation: The Parser class lib/Languages/Galach/Parser.php24 implements a shift-reduce parsing algorithm:

Shift Phase: Reads tokens from TokenSequence and pushes them or creates nodes
Reduce Phase: Combines stack elements into higher-level nodes based on reduction rules
Correction Phase: Applies corrections for malformed input while preserving intent

The parser maintains internal state using SplStack lib/Languages/Galach/Parser.php150 and never rejects input - all corrections are logged in the resulting SyntaxTree.

Node Hierarchy: The syntax tree contains typed nodes representing query structure:

Node Type	Purpose
`Query`	Root node containing top-level elements
`Term`	Leaf node wrapping a term token
`Group`	Grouped expression with left/right delimiters
`LogicalAnd`	Binary AND operator with left/right operands
`LogicalOr`	Binary OR operator with left/right operands
`LogicalNot`	Unary NOT operator with operand
`Mandatory`	Unary + operator with operand
`Prohibited`	Unary - operator with operand

Sources: lib/Languages/Galach/Parser.php1-644 lib/Languages/Galach/README.md114-120

Code Generation Layer

The code generation layer traverses the syntax tree to produce backend-specific query strings.

The visitor pattern enables polymorphic processing of different node types. Each generator composes an Aggregate visitor lib/Languages/Galach/Generators/Common/Aggregate.php that dispatches to specialized visitors based on node class and token type.

Generator Classes:

Native lib/Languages/Galach/Generators/Native.php - Produces Galach format strings
ExtendedDisMax lib/Languages/Galach/Generators/ExtendedDisMax.php - Produces Solr Extended DisMax queries
QueryString lib/Languages/Galach/Generators/QueryString.php - Produces Elasticsearch QueryString queries

Sources: lib/Languages/Galach/README.md437-473 lib/Languages/Galach/README.md92-103

Central Intermediate Representation

SyntaxTree Structure

The SyntaxTree serves as the pivot point between parsing and generation, decoupling input processing from output formatting.

Key Properties:

Property	Type	Purpose
`rootNode`	`Node`	Root of the hierarchical node tree
`tokenSequence`	`TokenSequence`	Reference to original tokens and source string
`corrections`	`Correction[]`	Array of applied corrections during parsing

The SyntaxTree maintains a reference to the original TokenSequence, allowing generators to access:

Original source string for position information
Token lexemes for escaping decisions
Token metadata (domain prefixes, quote types)

Sources: lib/Languages/Galach/Parser.php173 lib/Values/Correction.php1-38

Error Handling Through Corrections

The correction system ensures all input produces valid output while preserving error information.

Correction Types defined in Parser class constants lib/Languages/Galach/Parser.php27-76:

Constant	Scenario
`CORRECTION_ADJACENT_UNARY_OPERATOR_PRECEDING_OPERATOR_IGNORED`	`++one` → `+one`
`CORRECTION_UNARY_OPERATOR_MISSING_OPERAND_IGNORED`	`one NOT` → `one`
`CORRECTION_BINARY_OPERATOR_MISSING_LEFT_OPERAND_IGNORED`	`AND two` → `two`
`CORRECTION_BINARY_OPERATOR_MISSING_RIGHT_OPERAND_IGNORED`	`one AND` → `one`
`CORRECTION_BINARY_OPERATOR_FOLLOWING_OPERATOR_IGNORED`	`one AND OR two` → `one two`
`CORRECTION_LOGICAL_NOT_OPERATORS_PRECEDING_PREFERENCE_IGNORED`	`NOT +one` → `+one`
`CORRECTION_EMPTY_GROUP_IGNORED`	`one AND ()` → `one`
`CORRECTION_UNMATCHED_GROUP_LEFT_DELIMITER_IGNORED`	`one ( AND two` → `one AND two`
`CORRECTION_UNMATCHED_GROUP_RIGHT_DELIMITER_IGNORED`	`one ) AND two` → `one AND two`
`CORRECTION_BAILOUT_TOKEN_IGNORED`	`one " two` → `one two`

Each Correction instance lib/Values/Correction.php11 contains:

type: Integer constant identifying the correction type
tokens: Array of tokens affected by the correction

This enables UI features like syntax highlighting, error feedback, and input cleanup suggestions.

Sources: lib/Languages/Galach/Parser.php27-76 lib/Languages/Galach/README.md114-240 lib/Values/Correction.php1-38

Multi-Backend Support Architecture

One Tree, Multiple Outputs

The architecture's key strength is generating multiple backend formats from a single parsed representation.

This design provides several benefits:

Parse Once: Expensive parsing happens once regardless of output targets
Consistent Semantics: All backends receive identical query semantics
Correction Reuse: Error corrections apply uniformly across backends
Query Analysis: Applications can inspect SyntaxTree before generation

Sources: README.md10-15 lib/Languages/Galach/README.md37-50

Backend-Specific Customization

While the syntax tree is shared, generators customize output through:

Visitor Reuse: The ExtendedDisMax and QueryString generators share most visitor implementations through the Lucene\Common namespace, differing only in:

Word token escaping rules
Field mapping configuration
Special character handling

Escaping Strategies: Each generator implements backend-specific character escaping:

Native: Escapes Galach syntax characters (\, +, -, !, (, ), ", #, @, :, space)
ExtendedDisMax: Escapes Solr special characters (\, +, -, *, ?)
QueryString: Escapes Elasticsearch special characters (ExtendedDisMax set + =, >, <)

Field Mapping: Lucene-based generators accept optional field mapping arrays to translate domain prefixes to backend field names.

Sources: lib/Languages/Galach/README.md437-458

Data Flow Through the System

Complete Processing Pipeline

Step-by-Step Flow:

Input: User provides query string
Token Extraction: TokenExtractor provides regex patterns to Tokenizer
Tokenization: Tokenizer processes string, creates Token objects, produces TokenSequence
Parsing: Parser reads tokens, performs shift-reduce operations, applies corrections
Tree Construction: Parser builds hierarchical Node structure, creates SyntaxTree
Generation: Generator traverses tree, dispatches to visitors, builds output string
Output: Formatted query string ready for backend consumption

Sources: lib/Languages/Galach/README.md40-50 lib/Languages/Galach/README.md92-112

Token and Node Type Mapping

The transformation from tokens to nodes follows specific patterns:

Token Type	Node Type	Transformation
`TOKEN_TERM`	`Term`	Wrapped directly by `shiftTerm()` lib/Languages/Galach/Parser.php285-288
`TOKEN_MANDATORY`	`Mandatory`	Applied to operand by `reducePreference()` lib/Languages/Galach/Parser.php307-320
`TOKEN_PROHIBITED`	`Prohibited`	Applied to operand by `reducePreference()` lib/Languages/Galach/Parser.php307-320
`TOKEN_LOGICAL_NOT`	`LogicalNot`	Applied to operand by `reduceLogicalNot()` lib/Languages/Galach/Parser.php322-335
`TOKEN_LOGICAL_AND`	`LogicalAnd`	Combines operands by `reduceLogicalAnd()` lib/Languages/Galach/Parser.php349-359
`TOKEN_LOGICAL_OR`	`LogicalOr`	Combines operands by `reduceLogicalOr()` lib/Languages/Galach/Parser.php369-391
`TOKEN_GROUP_BEGIN` + `TOKEN_GROUP_END`	`Group`	Created by `shiftGroupEnd()`, populated by `reduceGroup()` lib/Languages/Galach/Parser.php295-415

The parser uses named shift methods mapped to token types lib/Languages/Galach/Parser.php89-101 and reduction methods organized by node type lib/Languages/Galach/Parser.php103-136

Sources: lib/Languages/Galach/Parser.php89-136 lib/Languages/Galach/Parser.php176-415

Extension Points

The architecture provides four primary extension points for customization:

TokenExtractor: Extend to define custom token recognition patterns, special characters, or language subsets. See Customization and Extension for details.

Visitor Pattern: Implement custom visitors to control node processing and output formatting. Visitors can be mixed and matched using Aggregate dispatcher.

Field Mapping: Configure domain-to-field translation for Lucene-based generators without subclassing.

Generator: Implement complete custom generators for new backends or output formats using existing visitors or custom implementations.

Sources: lib/Languages/Galach/README.md241-427 README.md31-45

Refresh this wiki

URL: https://deepwiki.com/netgen/query-translator/2-architecture

⇱ Architecture | netgen/query-translator | DeepWiki