VOOZH about

URL: https://deepwiki.com/netgen/query-translator/6-customization-and-extension

⇱ Customization and Extension | netgen/query-translator | DeepWiki


Loading...
Menu

Customization and Extension

This document describes the four primary extension points in the Query Translator library and provides guidance for implementing custom behavior. The system is designed with extensibility in mind, allowing customization of lexical rules, code generation, field mapping, and syntax features without modifying core code.

For information about the built-in Galach language features, see The Galach Language. For details about the tokenization process, see Tokenization. For information about built-in generators, see Query Generation.

Sources: README.md31-35 lib/Languages/Galach/README.md241-474


Extension Points Overview

The Query Translator provides four primary extension points that enable customization at different stages of the query processing pipeline:

Extension PointPurposeImplementation MechanismKey Use Cases
TokenExtractorCustomize lexical analysis rulesExtend abstract classChange special characters, select language features, implement custom term tokens
VisitorCustomize code generation logicImplement visitor interfaceSupport new search backends, customize output format
Field MappingTranslate domain prefixes to backend fieldsConfiguration arrayMap Galach domains to Solr/Elasticsearch field names
Domain PrefixEnable field-scoped queriesCaptured in Token dataAllow users to target specific fields in queries

Sources: README.md31-50 lib/Languages/Galach/README.md241-253

Extension Points Architecture


Sources: lib/Languages/Galach/TokenExtractor.php1-124 lib/Languages/Galach/README.md241-427


TokenExtractor Customization

The TokenExtractor abstract class is the primary extension point for customizing lexical analysis. By extending this class, you can control which tokens are recognized, change special characters used in syntax, select language features, and implement custom term tokens.

Sources: lib/Languages/Galach/TokenExtractor.php9-14 lib/Languages/Galach/README.md255-260

Implementation Requirements

When extending TokenExtractor, you must implement two abstract methods:

MethodPurposeParametersReturn Type
getExpressionTypeMap()Define regex patterns for token recognitionNonearray mapping regex to token type constants
createTermToken()Create term token from matched data$position (int), $data (array)Token instance

Optionally, you can override:

  • createGroupBeginToken() - Customize TOKEN_GROUP_BEGIN token creation

Sources: lib/Languages/Galach/TokenExtractor.php51-73 lib/Languages/Galach/README.md392-420

TokenExtractor Implementation Flow


Sources: lib/Languages/Galach/TokenExtractor.php26-49 lib/Languages/Galach/README.md392-420

Regular Expression Requirements

The getExpressionTypeMap() method must return an array where:

  • Keys are regular expressions (PCRE format)
  • Values are token type constants from Tokenizer::TOKEN_*
  • Each regex must define a named capturing group 'lexeme' that identifies the recognized token

The tokenizer will attempt to match regexes in the order provided. The first successful match determines the token type.

Example regex pattern structure:


Where:

  • \G anchors the match to the current position
  • (?<lexeme>...) defines the named capturing group
  • /u enables UTF-8 mode

Sources: lib/Languages/Galach/TokenExtractor.php51-61 lib/Languages/Galach/README.md389-398

Reference Implementations

The library provides two TokenExtractor implementations that serve as both production code and reference examples:

Full TokenExtractor

Located at lib/Languages/Galach/TokenExtractor/Full.php this implementation supports the complete Galach syntax including:

  • All term types: Word, Phrase, Tag, User
  • All operators: AND, &&, OR, ||, NOT, !, +, -
  • Grouping with parentheses
  • Domain prefixes on terms and groups
  • Whitespace handling

Sources: lib/Languages/Galach/README.md422-426

Text TokenExtractor

Located at lib/Languages/Galach/TokenExtractor/Text.php this implementation provides a simplified subset:

  • Only Word and Phrase terms
  • No Tag or User support
  • No domain prefix support
  • Basic operators and grouping

This extractor is useful for simple text search use cases where the full syntax is not needed.

Sources: lib/Languages/Galach/README.md422-426

Full vs Text TokenExtractor Comparison

FeatureFull TokenExtractorText TokenExtractor
Word terms
Phrase terms
Tag terms (#tag)
User terms (@user)
Domain prefixes (field:term)
Logical operators (AND, OR, NOT)
Unary operators (+, -, !)
Grouping ((, ))
Use caseComplete query languageSimple text search

Sources: lib/Languages/Galach/README.md422-426

Creating Custom Term Tokens

The createTermToken() method receives data extracted through regex matching and must return a Token instance. You can create custom Token subclasses to represent specialized term types.

Process:

  1. Use named capturing groups in your regex to extract semantic data
  2. In createTermToken(), parse the $data array to access captured groups
  3. Create and return an instance of your custom Token subclass

Example of token creation pattern:


Sources: lib/Languages/Galach/TokenExtractor.php63-73 lib/Languages/Galach/README.md399-408

Customizing Syntax Elements

By modifying the regex patterns returned by getExpressionTypeMap(), you can:

Change Special Characters

Modify the characters used for operators, delimiters, and markers:

  • Operators: AND, &&, OR, ||, NOT, !, +, -
  • Delimiters: (, ), "
  • Markers: @ (user), # (tag)
  • Domain separator: :

Example: Use ~ instead of ! for shorthand logical NOT.

Sources: lib/Languages/Galach/README.md244-250

Select Language Features

Omit regex patterns for token types you don't want to support:

  • Remove TOKEN_TAG pattern to disable tag support
  • Remove TOKEN_USER pattern to disable user support
  • Remove TOKEN_LOGICAL_AND pattern to disable AND operator
  • Remove domain capture groups to disable domain prefixes

This allows you to create a restricted subset of the Galach language tailored to your use case.

Sources: lib/Languages/Galach/README.md250-253

Handle Unrecognized Input

When no regex matches at a position, the TokenExtractor automatically creates a TOKEN_BAILOUT token containing a single character. The parser will later ignore these tokens and report them as corrections.

This ensures the system is resistant to errors and always produces valid output.

Sources: lib/Languages/Galach/TokenExtractor.php44-49 lib/Languages/Galach/README.md122-126


Visitor and Generator Customization

The Visitor pattern is used to traverse the SyntaxTree and generate backend-specific output. Customization happens through implementing visitor classes for different node types and composing them with the Aggregate dispatcher.

Sources: lib/Languages/Galach/README.md437-473

Visitor Pattern Architecture


Sources: lib/Languages/Galach/README.md459-468

Creating Custom Generators

To create a custom generator for a new search backend:

  1. Create visitor classes for each node type you need to handle
  2. Instantiate Aggregate visitor with your visitor collection
  3. Create generator class that uses the Aggregate visitor
  4. Implement backend-specific escaping in your visitors

Example generator instantiation pattern:


Sources: lib/Languages/Galach/README.md86-112 lib/Languages/Galach/README.md459-473

Visitor Method Signatures

Each visitor must implement two methods from the Visitor interface:

accept(Node $node): bool

Determines whether this visitor can handle the given node.

  • Parameter: $node - The node to check
  • Returns: bool - true if this visitor handles the node, false otherwise

visit(Node $node, Visitor $subVisitor = null, $options = null): mixed

Processes the node and generates output.

  • Parameters:
    • $node - The node to process
    • $subVisitor - Optional visitor for processing child nodes
    • $options - Optional parameters to control generation behavior
  • Returns: Mixed - Generated output (typically string)

Sources: lib/Languages/Galach/README.md459-465

Aggregate Visitor Dispatcher

The Aggregate visitor dispatches nodes to appropriate concrete visitors based on the node's class. It implements the composite pattern, checking each visitor's accept() method until finding one that returns true.

Located at lib/Languages/Galach/Generators/Common/Aggregate.php this dispatcher enables:

  • Visitor composition - Combine multiple visitors into a single entry point
  • Polymorphic dispatch - Route nodes to appropriate handlers automatically
  • Visitor reuse - Share common visitors across different generators

Sources: lib/Languages/Galach/README.md459-465

Term Visitor Dispatch Strategy

Term nodes require special handling because a single Term node can represent different token types (Word, Phrase, Tag, User, or custom types). Visitors should dispatch based on both:

  1. Node type - Check $node instanceof Term
  2. Token type - Check the token aggregated by the term node

Example dispatch pattern:


Sources: lib/Languages/Galach/README.md461-463

Field Mapping Configuration

Field mapping translates Galach domain prefixes to backend-specific field names. This is particularly important for the Lucene-based generators (ExtendedDisMax and QueryString).

Configuration structure:


Usage in visitors:

  • Pass field map to visitor constructors
  • Visitors use the map when processing domain prefixes
  • Unmapped domains can use a default field or be passed through unchanged

Sources: lib/Languages/Galach/README.md449-457

Backend-Specific Escaping

Different search backends require different characters to be escaped. Implement escaping logic in your visitor's visit() method:

BackendCharacters to Escape
Native Galach\, +, -, !, (, ), :, ", @, #, (space)
ExtendedDisMax\, +, -, *, ?
QueryString\, +, -, =, >, <, !, (, ), {, }, [, ], ^, ", ~, *, ?, :, /

Escaping pattern:


Sources: lib/Languages/Galach/README.md449-457

Reusing Lucene Common Components

The library provides shared visitor components for Lucene-based backends at lib/Languages/Galach/Generators/Lucene/Common/:

  • BinaryOperator - Handles LogicalAnd and LogicalOr nodes
  • Group - Handles Group nodes
  • Phrase - Handles phrase terms
  • Query - Handles root Query node
  • Tag - Handles tag terms
  • UnaryOperator - Handles LogicalNot, Mandatory, Prohibited nodes
  • User - Handles user terms

These can be reused in custom generators targeting Lucene-based backends, with only word visitor and field mapping needing customization.

Sources: lib/Languages/Galach/README.md449-457

Visitor Options Parameter

The $options parameter in visit() allows external control of generation behavior. Uses include:

  • Field mapping - Pass field map to control domain translation
  • Generation flags - Enable/disable features like phrase slop or fuzzy matching
  • Context information - Pass parent node context for conditional logic

Options are propagated through the visitor tree, allowing child visitors to access them.

Sources: lib/Languages/Galach/README.md464-465


Custom Generator Implementation Example


Sources: lib/Languages/Galach/README.md86-112 lib/Languages/Galach/README.md437-473


Parser Customization Limitations

The Parser class at lib/Languages/Galach/Parser.php is marked as final and is not intended for extension. It only examines token types (via Tokenizer::TOKEN_* constants) and ignores token subtypes and custom data.

This design ensures:

  • TokenExtractor customizations are transparent - Custom tokens with type TOKEN_TERM are processed like any other term
  • Syntax subset selection works automatically - Parser handles whatever tokens the tokenizer produces
  • Core parsing logic remains stable - No need to modify parser for common customizations

If you need fundamentally different parsing behavior, you should implement a new language rather than attempt to customize Galach.

Sources: lib/Languages/Galach/README.md428-435 lib/Parsing.php1-20


Extension Strategy Summary

Choose your extension approach based on requirements:

RequirementExtension PointImplementation
Change operator symbolsTokenExtractorOverride regex patterns in getExpressionTypeMap()
Add custom term typesTokenExtractorImplement createTermToken() with custom Token subclass
Support subset of syntaxTokenExtractorOmit unwanted token patterns from getExpressionTypeMap()
Target new search backendVisitorCreate visitor set and generator class
Customize field translationConfigurationPass field map to visitors via options or constructor
Modify output formatVisitorImplement custom visit() methods with different output
Reuse existing visitorsCompositionUse Lucene common visitors with custom word visitor

Sources: lib/Languages/Galach/README.md241-473

Refresh this wiki

On this page