VOOZH about

URL: https://deepwiki.com/netgen/query-translator/4.2-token-extractors

⇱ Token Extractors | netgen/query-translator | DeepWiki


Loading...
Menu

Token Extractors

Purpose and Scope

This document explains the TokenExtractor abstract class, which serves as the primary extension point for customizing lexical rules in the Query Translator system. Token extractors define the regular expression patterns that recognize different token types in input query strings and create appropriate Token objects from matched text.

This page covers the abstract TokenExtractor base class, the two provided implementations (Full and Text), and guidance for implementing custom extractors. For information about the token types themselves, see Token Types. For details on how extractors are used in the tokenization process, see Tokenization Process.

Sources: lib/Languages/Galach/TokenExtractor.php1-124


TokenExtractor Abstract Class

The TokenExtractor abstract class is located at lib/Languages/Galach/TokenExtractor.php and provides the foundation for all token extraction implementations. It defines the contract for extracting tokens from input strings and handles the common logic of regex matching and byte offset calculation.

Class Hierarchy


Sources: lib/Languages/Galach/TokenExtractor.php14-124 lib/Languages/Galach/TokenExtractor/Full.php18 lib/Languages/Galach/TokenExtractor/Text.php17

Extraction Workflow

The extract() method at lib/Languages/Galach/TokenExtractor.php26-49 orchestrates the token extraction process:


Sources: lib/Languages/Galach/TokenExtractor.php26-49

Key Methods

MethodVisibilityRequiredPurpose
extract(string, int)public finalInheritedMain entry point - extracts token at given position
getExpressionTypeMap()protected abstractMust overrideReturns array mapping regex patterns to token types
createTermToken(int, array)protected abstractMust overrideCreates term-type tokens (Word, Phrase, Tag, User)
createGroupBeginToken(int, array)protectedOptionalCreates GroupBegin tokens, can be overridden
getByteOffset(string, int)privateInheritedConverts character position to byte offset for preg_match()

Sources: lib/Languages/Galach/TokenExtractor.php26-123

Expression Type Map Format

The getExpressionTypeMap() method must return an array where:

  • Keys are PCRE regular expressions with the /Au flags
  • Values are token type constants from Tokenizer class
  • Each regex must define a named capture group (?<lexeme>...) identifying the matched token text
  • Additional named capture groups can extract token-specific data

The /Au flags are critical:

  • /A anchors matching at the current byte offset
  • /u enables UTF-8 mode for proper multi-byte character handling

Sources: lib/Languages/Galach/TokenExtractor.php51-60


Full Implementation

The Full implementation at lib/Languages/Galach/TokenExtractor/Full.php supports all features of the Galach language, including domain prefixes, tags, and users.

Supported Token Types and Patterns

The Full extractor defines expressions for all Galach tokens at lib/Languages/Galach/TokenExtractor/Full.php25-39:


Sources: lib/Languages/Galach/TokenExtractor/Full.php25-39

Domain Support

Full extractor recognizes optional domain prefixes in three contexts:

  1. GROUP_BEGIN at lib/Languages/Galach/TokenExtractor/Full.php34:

    /(?<lexeme>(?:(?<domain>[a-zA-Z_][a-zA-Z0-9_\-.]*):)?(?<delimiter>\())/Au
    

    Matches: domain:( or just (

  2. PHRASE at lib/Languages/Galach/TokenExtractor/Full.php37:

    /(?<lexeme>(?:(?<domain>[a-zA-Z_][a-zA-Z0-9_\-.]*):)?(?<quote>(?<!\\\\)["])(?<phrase>.*?)(?:(?<!\\\\)(?P=quote)))/Aus
    

    Matches: domain:"phrase" or "phrase"

  3. WORD at lib/Languages/Galach/TokenExtractor/Full.php38:

    /(?<lexeme>(?:(?<domain>[a-zA-Z_][a-zA-Z0-9_\-.]*):)?(?<word>(?:\\\\\\\\|\\\\ |\\\\\(|\\\\\)|\\\\"|[^"()\s])+?))/Au
    

    Matches: domain:word or word

Sources: lib/Languages/Galach/TokenExtractor/Full.php34-38

Term Token Creation

The createTermToken() method at lib/Languages/Galach/TokenExtractor/Full.php46-87 creates four types of term tokens based on which named capture group is set:


Special character unescaping:

Sources: lib/Languages/Galach/TokenExtractor/Full.php46-87


Text Implementation

The Text implementation at lib/Languages/Galach/TokenExtractor/Text.php provides a simplified subset of Galach features, supporting only basic text search functionality without domains, tags, or users.

Supported Token Types

The Text extractor defines a reduced set of expressions at lib/Languages/Galach/TokenExtractor/Text.php24-36:

PriorityToken TypePatternDescription
1WHITESPACE/[\s]+/One or more whitespace characters
2MANDATORY/\+/Plus sign operator
3PROHIBITED/-/Minus sign operator
4LOGICAL_NOT_2/!/Exclamation mark operator
5GROUP_END/\)/Closing parenthesis
6LOGICAL_NOT/NOT/NOT keyword
7LOGICAL_AND/AND|&&/AND keyword or &&
8LOGICAL_OR/OR|||/OR keyword or ||
9GROUP_BEGIN/\(/Opening parenthesis (no domain)
10TERM (phrase)/"..."/Quoted phrase (no domain)
11TERM (word)/[term]/Word term (no domain)

Sources: lib/Languages/Galach/TokenExtractor/Text.php24-36

Differences from Full


Key differences:

  1. No domain support: GROUP_BEGIN, PHRASE, and WORD patterns do not capture domain prefixes
  2. No Tag tokens: Pattern for #identifier is not included
  3. No User tokens: Pattern for @identifier is not included
  4. Simplified special character escaping: Only escapes \ + - ! ( ) " space at lib/Languages/Galach/TokenExtractor/Text.php54
  5. Override createGroupBeginToken: Provides simplified version at lib/Languages/Galach/TokenExtractor/Text.php72-75 that always sets domain to empty string

Sources: lib/Languages/Galach/TokenExtractor/Text.php24-76 lib/Languages/Galach/TokenExtractor/Full.php25-88


Feature Comparison

FeatureFullTextNotes
WhitespaceIdentical pattern
Mandatory (+)Identical pattern
Prohibited (-)Identical pattern
Logical NOT (!, NOT)Identical patterns
Logical ANDIdentical pattern
Logical ORIdentical pattern
Group BeginFull supports domain prefix
Group EndIdentical pattern
Word termsFull supports domain prefix
Phrase termsFull supports domain prefix
Tag terms (#tag)Full only
User terms (@user)Full only
Domain prefixesFull only
Escaped characters10 chars7 charsFull adds :, #, @

Use case guidance:

  • Use Full: When you need complete Galach syntax including field-scoped queries, tag filtering, or user mentions
  • Use Text: For simple text search interfaces where only words, phrases, and basic operators are needed

Sources: lib/Languages/Galach/TokenExtractor/Full.php25-88 lib/Languages/Galach/TokenExtractor/Text.php24-76


Implementing Custom Extractors

Custom token extractors enable extending or modifying the lexical rules of Galach or creating entirely new query languages.

Required Implementation

To create a custom extractor, extend TokenExtractor and implement two abstract methods:


Sources: lib/Languages/Galach/TokenExtractor.php61-73

Example: Minimal Custom Extractor

Here's a conceptual outline of a minimal custom extractor that only supports words and whitespace:

Step 1: Define the expression type map

Return an array mapping regex patterns to token types. Each regex must have a (?<lexeme>...) named capture group:


Step 2: Implement term token creation

Check which named capture groups are present and create the appropriate token:


Sources: lib/Languages/Galach/TokenExtractor/Full.php41-44 lib/Languages/Galach/TokenExtractor/Full.php46-87

Regex Pattern Guidelines

When defining custom patterns:

  1. Always use /Au flags:

    • /A anchors match at current position
    • /u enables UTF-8 mode
  2. Define (?<lexeme>...) capture group: Required for all patterns at lib/Languages/Galach/TokenExtractor.php56-57

  3. Order matters: Expressions are tried in array order. More specific patterns should come before general ones.

  4. Use negative lookbehind for escaping: Prevent matching escaped characters:

    (?<!\\\\)["] // Match quote not preceded by backslash
    
  5. Use lookahead/lookbehind for word boundaries: Prevent operators from matching inside words at lib/Languages/Galach/TokenExtractor/Full.php31-33

  6. Handle multi-byte characters: The getByteOffset() method at lib/Languages/Galach/TokenExtractor.php120-123 converts character positions to byte offsets for proper UTF-8 handling

Sources: lib/Languages/Galach/TokenExtractor.php26-49 lib/Languages/Galach/TokenExtractor/Full.php25-39

Testing Custom Extractors

The test at tests/Galach/Tokenizer/TokenExtractorTest.php17-36 demonstrates testing for PCRE errors. Custom extractors should test:

  • Valid token extraction for all supported patterns
  • Proper handling of escaped characters
  • BAILOUT token creation for unrecognized input
  • RuntimeException throwing for malformed data
  • Byte offset handling for multi-byte characters

Sources: tests/Galach/Tokenizer/TokenExtractorTest.php1-75


Integration with Tokenizer

The TokenExtractor is used by the Tokenizer class (see Tokenization Process) which:

  1. Calls extract() at each position in the input string
  2. Collects returned tokens into a TokenSequence
  3. Handles BAILOUT tokens for unrecognized input
  4. Maintains position tracking as extraction progresses

The extractor is passed to the Tokenizer constructor, making it easy to swap implementations or use custom extractors.

Sources: lib/Languages/Galach/TokenExtractor.php26-49