VOOZH about

URL: https://deepwiki.com/netgen/query-translator/2.1-core-components

⇱ Core Components | netgen/query-translator | DeepWiki


Loading...
Menu

Core Components

This page documents the fundamental building blocks of the Query Translator system. These components form the foundation upon which all tokenization, parsing, and code generation functionality is built. The core components include:

  • Token value objects - Immutable data structures representing lexemes extracted from input strings
  • Node value objects - Immutable data structures representing syntax tree elements
  • TokenExtractor abstract class - Primary extension point for customizing lexical analysis rules
  • Parsing interface - Contract for implementing parsers that convert token sequences into syntax trees
  • Visitor pattern base classes - Infrastructure for traversing and processing syntax trees

For details on how these components are used in the complete translation flow, see Query Processing Pipeline. For information on specific token types and extraction strategies, see Token Types and Token Extractors. For visitor implementations, see Visitor Pattern Implementation.


Token Value Objects

Tokens are immutable value objects that represent recognized lexemes in the input query string. All tokens share a common base structure and extend it with type-specific properties.

Base Token Structure

The base Token class (lib/Values/Token.php) provides the fundamental properties shared by all tokens:

PropertyTypeDescription
typeintToken type constant (e.g., TOKEN_TERM, TOKEN_WHITESPACE)
lexemestringThe actual text matched from the input string
positionintCharacter position in the input string where the token starts

Diagram: Token Class Hierarchy


Sources: lib/Values/Token.php lib/Languages/Galach/Values/Token/Word.php lib/Languages/Galach/Values/Token/Phrase.php lib/Languages/Galach/Values/Token/GroupBegin.php

Specialized Token Types

Each specialized token extends the base Token class with additional properties specific to its semantic meaning:

Word Token

Represents a simple word term with optional domain prefix (lib/Languages/Galach/Values/Token/Word.php1-41):

domain:word → Word token: domain="domain", word="word"
word → Word token: domain="", word="word"

Properties:

  • domain (string) - Field or domain prefix (empty string if not present)
  • word (string) - The word text itself

Phrase Token

Represents a quoted phrase with optional domain prefix (lib/Languages/Galach/Values/Token/Phrase.php1-48):

domain:"phrase text" → Phrase token: domain="domain", quote='"', phrase="phrase text"
"phrase text" → Phrase token: domain="", quote='"', phrase="phrase text"

Properties:

  • domain (string|null) - Field or domain prefix
  • quote (string) - The quote character used (" or ')
  • phrase (string) - The phrase content without quotes

GroupBegin Token

Represents the opening of a group, which may include a domain prefix (lib/Languages/Galach/Values/Token/GroupBegin.php1-41):

domain:( → GroupBegin token: domain="domain", delimiter="("
( → GroupBegin token: domain="", delimiter="("

Properties:

  • delimiter (string) - The left side delimiter (typically "(")
  • domain (string) - Domain prefix (empty string if not present)

Tag and User Tokens

Specialized tokens for social media-style syntax (lib/Languages/Galach/Values/Token/Tag.php lib/Languages/Galach/Values/Token/User.php):

#tag → Tag token: marker="#", tag="tag"
@user → User token: marker="@", user="user"

Sources: lib/Languages/Galach/Values/Token/Word.php1-41 lib/Languages/Galach/Values/Token/Phrase.php1-48 lib/Languages/Galach/Values/Token/GroupBegin.php1-41

Token Type Constants

Token types are defined as constants in the Tokenizer class (lib/Languages/Galach/Tokenizer.php):

ConstantDescription
TOKEN_WHITESPACEWhitespace between tokens
TOKEN_TERMTerm tokens (Word, Phrase, Tag, User)
TOKEN_LOGICAL_ANDAND operator
TOKEN_LOGICAL_OROR operator
TOKEN_LOGICAL_NOTNOT operator
TOKEN_MANDATORYMandatory operator (+)
TOKEN_PROHIBITEDProhibited operator (-)
TOKEN_GROUP_BEGINGroup opening delimiter
TOKEN_GROUP_ENDGroup closing delimiter
TOKEN_BAILOUTUnrecognized input

Sources: lib/Languages/Galach/Tokenizer.php


Node Value Objects

Nodes are immutable value objects that form the syntax tree (AST) representation of a parsed query. Each node type represents a different syntactic construct.

Diagram: Node Type Hierarchy


Sources: lib/Values/Node.php lib/Languages/Galach/Values/Node/Query.php lib/Languages/Galach/Values/Node/Term.php lib/Languages/Galach/Values/Node/Group.php

Base Node Structure

The base Node class (lib/Values/Node.php) provides:

  • tokenType (string) - Categorizes the node by token origin
  • semanticType (string) - Categorizes the node by semantic role in the query

Key Node Types

Query Node

The root node containing the top-level sequence of nodes (lib/Languages/Galach/Values/Node/Query.php):

  • nodes (array) - Ordered array of child Node objects

Term Node

Represents a terminal search term (lib/Languages/Galach/Values/Node/Term.php):

  • token (Token) - The source Token object (Word, Phrase, Tag, or User)
  • domain (string) - Field or domain prefix
  • isQuoted (bool) - Whether the term was quoted in input

Group Node

Represents a parenthesized grouping (lib/Languages/Galach/Values/Node/Group.php):

  • tokenLeft (Token) - The opening GroupBegin token
  • tokenRight (Token|null) - The closing token (may be null if missing)
  • domain (string) - Domain prefix
  • nodes (array) - Child nodes within the group

Logical Operator Nodes

Binary operators (lib/Languages/Galach/Values/Node/LogicalAnd.php lib/Languages/Galach/Values/Node/LogicalOr.php):

  • leftOperand (Node) - Left side operand
  • rightOperand (Node) - Right side operand

Unary operators (lib/Languages/Galach/Values/Node/LogicalNot.php lib/Languages/Galach/Values/Node/Mandatory.php lib/Languages/Galach/Values/Node/Prohibited.php):

  • operand (Node) - The single operand

Sources: lib/Languages/Galach/Values/Node/Query.php lib/Languages/Galach/Values/Node/Term.php lib/Languages/Galach/Values/Node/Group.php lib/Languages/Galach/Values/Node/LogicalAnd.php lib/Languages/Galach/Values/Node/LogicalOr.php


TokenExtractor Abstract Class

The TokenExtractor abstract class (lib/Languages/Galach/TokenExtractor.php1-124) is the primary extension point for customizing lexical analysis. It defines how raw input strings are broken into tokens.

Diagram: TokenExtractor Architecture


Sources: lib/Languages/Galach/TokenExtractor.php1-124

Key Methods

extract() - Public API

The extract() method (lib/Languages/Galach/TokenExtractor.php26-49) is the main entry point:


Process:

  1. Convert character position to byte offset for multi-byte support (lib/Languages/Galach/TokenExtractor.php28)
  2. Iterate through regex patterns from getExpressionTypeMap() (lib/Languages/Galach/TokenExtractor.php30)
  3. Attempt to match each pattern starting at the byte offset (lib/Languages/Galach/TokenExtractor.php31)
  4. On match, create appropriate token via createToken() (lib/Languages/Galach/TokenExtractor.php41)
  5. If no pattern matches, return a BAILOUT token (lib/Languages/Galach/TokenExtractor.php44-48)

The method is declared final to ensure consistent behavior across all implementations.

getExpressionTypeMap() - Extension Point

Abstract method that subclasses must implement (lib/Languages/Galach/TokenExtractor.php61):


Returns an associative array where:

  • Keys are PCRE regular expressions with a named capture group lexeme
  • Values are token type constants (e.g., Tokenizer::TOKEN_TERM)

Example structure:

[
 '/(?<lexeme>AND)/A' => Tokenizer::TOKEN_LOGICAL_AND,
 '/(?<lexeme>OR)/A' => Tokenizer::TOKEN_LOGICAL_OR,
 // ... more patterns
]

The patterns are tried in order, so more specific patterns should appear before more general ones.

createTermToken() - Extension Point

Abstract method for creating term tokens (lib/Languages/Galach/TokenExtractor.php73):


Subclasses implement this to create specialized term tokens (Word, Phrase, Tag, User) based on the regex match data. The $data array contains named capture groups from the matched pattern.

createGroupBeginToken() - Protected Implementation

Creates GroupBegin tokens (lib/Languages/Galach/TokenExtractor.php105-108):


Expects the match data to contain:

  • lexeme - Full matched text
  • delimiter - Opening delimiter character
  • domain - Domain prefix (may be empty string)

getByteOffset() - Internal Utility

Converts character position to byte offset for preg_match() (lib/Languages/Galach/TokenExtractor.php120-123):


This is necessary because preg_match()'s offset parameter expects bytes, not characters, which matters for multi-byte UTF-8 strings.

Sources: lib/Languages/Galach/TokenExtractor.php26-123

Token Creation Flow

Diagram: Token Creation Decision Flow


Sources: lib/Languages/Galach/TokenExtractor.php26-95


Parsing Interface

The Parsing interface (lib/Parsing.php1-21) defines the contract for all parser implementations:


Diagram: Parsing Interface in Context


Sources: lib/Parsing.php1-21

Method Specification

parse()


Input: TokenSequence - An immutable sequence of Token objects (lib/Values/TokenSequence.php)

Output: SyntaxTree - An immutable tree structure containing the root Query node and correction information (lib/Values/SyntaxTree.php)

Contract guarantees:

  • Always returns a valid SyntaxTree, even for malformed input
  • May include correction information documenting syntax errors
  • Should be deterministic (same input produces same output)

For details on the Galach language parser implementation, see Parser. For information on how corrections work, see Error Handling and Corrections.

Sources: lib/Parsing.php7-19


Visitor Pattern Base Classes

The Visitor pattern enables polymorphic traversal and processing of the syntax tree. The system provides base interfaces and utility classes for implementing visitors.

Diagram: Visitor Pattern Architecture


Sources: lib/Visitor.php lib/Visitor/Aggregate.php

Visitor Interface

The Visitor interface (lib/Visitor.php) defines two methods:

accept()


Determines whether this visitor can process the given node type. Returns a boolean or may return the result of visiting the node directly.

visit()


Processes the node and returns a result. Additional arguments can be passed through for context-specific processing.

Aggregate Visitor

The Aggregate visitor (lib/Visitor/Aggregate.php) implements a dispatcher pattern:

Functionality:

  1. Holds a map of node types to specialized visitor instances
  2. When visiting a node, looks up the appropriate visitor for that node type
  3. Delegates processing to the specialized visitor
  4. Enables composition of visitors without complex inheritance hierarchies

Usage pattern:

$aggregate = new Aggregate([
 'LogicalAnd' => new AndVisitor(),
 'LogicalOr' => new OrVisitor(),
 'Term' => new TermVisitor(),
 // ...
]);

$result = $aggregate->visit($syntaxTree->rootNode);

This allows:

  • Visitor reuse across different generator implementations
  • Modular visitor design with each visitor handling one node type
  • Runtime visitor composition by configuring the aggregate with different visitor sets

For details on how visitors are used in code generation, see Visitor Pattern Implementation and Lucene Generators Common Components.

Sources: lib/Visitor.php lib/Visitor/Aggregate.php


Component Interaction Summary

Diagram: Core Component Relationships


Sources: lib/Values/Token.php lib/Values/TokenSequence.php lib/Values/Node.php lib/Values/SyntaxTree.php lib/Languages/Galach/TokenExtractor.php lib/Parsing.php lib/Visitor.php

Design Principles

The core components embody several key design principles:

PrincipleImplementation
ImmutabilityToken and Node objects are immutable value objects
Extension over modificationTokenExtractor and Visitor are designed as extension points
Interface segregationParsing interface is minimal with single responsibility
Composition over inheritanceAggregate visitor enables visitor composition
Type safetyConcrete token and node types provide type-specific properties

These components work together to enable the three-phase translation pipeline: tokenization (using TokenExtractor), parsing (using Parsing implementations), and generation (using Visitor implementations).

Sources: lib/Languages/Galach/TokenExtractor.php1-124 lib/Parsing.php1-21 lib/Visitor.php