Last indexed: 13 February 2026 (50f4d4)

Token Extractors

Purpose and Scope

This document explains the TokenExtractor abstract class, which serves as the primary extension point for customizing lexical rules in the Query Translator system. Token extractors define the regular expression patterns that recognize different token types in input query strings and create appropriate Token objects from matched text.

This page covers the abstract TokenExtractor base class, the two provided implementations (Full and Text), and guidance for implementing custom extractors. For information about the token types themselves, see Token Types. For details on how extractors are used in the tokenization process, see Tokenization Process.

Sources: lib/Languages/Galach/TokenExtractor.php1-124

TokenExtractor Abstract Class

The TokenExtractor abstract class is located at lib/Languages/Galach/TokenExtractor.php and provides the foundation for all token extraction implementations. It defines the contract for extracting tokens from input strings and handles the common logic of regex matching and byte offset calculation.

Class Hierarchy

Sources: lib/Languages/Galach/TokenExtractor.php14-124 lib/Languages/Galach/TokenExtractor/Full.php18 lib/Languages/Galach/TokenExtractor/Text.php17

Extraction Workflow

The extract() method at lib/Languages/Galach/TokenExtractor.php26-49 orchestrates the token extraction process:

Sources: lib/Languages/Galach/TokenExtractor.php26-49

Key Methods

Method	Visibility	Required	Purpose
`extract(string, int)`	`public final`	Inherited	Main entry point - extracts token at given position
`getExpressionTypeMap()`	`protected abstract`	Must override	Returns array mapping regex patterns to token types
`createTermToken(int, array)`	`protected abstract`	Must override	Creates term-type tokens (Word, Phrase, Tag, User)
`createGroupBeginToken(int, array)`	`protected`	Optional	Creates GroupBegin tokens, can be overridden
`getByteOffset(string, int)`	`private`	Inherited	Converts character position to byte offset for `preg_match()`

Sources: lib/Languages/Galach/TokenExtractor.php26-123

Expression Type Map Format

The getExpressionTypeMap() method must return an array where:

Keys are PCRE regular expressions with the /Au flags
Values are token type constants from Tokenizer class
Each regex must define a named capture group (?<lexeme>...) identifying the matched token text
Additional named capture groups can extract token-specific data

The /Au flags are critical:

/A anchors matching at the current byte offset
/u enables UTF-8 mode for proper multi-byte character handling

Sources: lib/Languages/Galach/TokenExtractor.php51-60

Full Implementation

The Full implementation at lib/Languages/Galach/TokenExtractor/Full.php supports all features of the Galach language, including domain prefixes, tags, and users.

Supported Token Types and Patterns

The Full extractor defines expressions for all Galach tokens at lib/Languages/Galach/TokenExtractor/Full.php25-39:

Sources: lib/Languages/Galach/TokenExtractor/Full.php25-39

Domain Support

Full extractor recognizes optional domain prefixes in three contexts:

GROUP_BEGIN at lib/Languages/Galach/TokenExtractor/Full.php34:
```
/(?<lexeme>(?:(?<domain>[a-zA-Z_][a-zA-Z0-9_\-.]*):)?(?<delimiter>\())/Au
```
Matches: domain:( or just (

PHRASE at lib/Languages/Galach/TokenExtractor/Full.php37:

/(?<lexeme>(?:(?<domain>[a-zA-Z_][a-zA-Z0-9_\-.]*):)?(?<quote>(?<!\\\\)["])(?<phrase>.*?)(?:(?<!\\\\)(?P=quote)))/Aus

Matches: domain:"phrase" or "phrase"

WORD at lib/Languages/Galach/TokenExtractor/Full.php38:

/(?<lexeme>(?:(?<domain>[a-zA-Z_][a-zA-Z0-9_\-.]*):)?(?<word>(?:\\\\\\\\|\\\\ |\\\\\(|\\\\\)|\\\\"|[^"()\s])+?))/Au

Matches: domain:word or word

Sources: lib/Languages/Galach/TokenExtractor/Full.php34-38

Term Token Creation

The createTermToken() method at lib/Languages/Galach/TokenExtractor/Full.php46-87 creates four types of term tokens based on which named capture group is set:

Special character unescaping:

Word tokens: Un-backslash \ + - ! ( ) : # @ space at lib/Languages/Galach/TokenExtractor/Full.php57
Phrase tokens: Un-backslash the quote character at lib/Languages/Galach/TokenExtractor/Full.php68

Sources: lib/Languages/Galach/TokenExtractor/Full.php46-87

Text Implementation

The Text implementation at lib/Languages/Galach/TokenExtractor/Text.php provides a simplified subset of Galach features, supporting only basic text search functionality without domains, tags, or users.

Supported Token Types

The Text extractor defines a reduced set of expressions at lib/Languages/Galach/TokenExtractor/Text.php24-36:

Priority	Token Type	Pattern	Description
1	`WHITESPACE`	`/[\s]+/`	One or more whitespace characters
2	`MANDATORY`	`/\+/`	Plus sign operator
3	`PROHIBITED`	`/-/`	Minus sign operator
4	`LOGICAL_NOT_2`	`/!/`	Exclamation mark operator
5	`GROUP_END`	`/\)/`	Closing parenthesis
6	`LOGICAL_NOT`	`/NOT/`	NOT keyword
7	`LOGICAL_AND`	`/AND\|&&/`	AND keyword or &&
8	`LOGICAL_OR`	`/OR\|\|\|/`	OR keyword or \|\|
9	`GROUP_BEGIN`	`/\(/`	Opening parenthesis (no domain)
10	`TERM (phrase)`	`/"..."/`	Quoted phrase (no domain)
11	`TERM (word)`	`/[term]/`	Word term (no domain)

Sources: lib/Languages/Galach/TokenExtractor/Text.php24-36

Differences from Full

Key differences:

No domain support: GROUP_BEGIN, PHRASE, and WORD patterns do not capture domain prefixes
No Tag tokens: Pattern for #identifier is not included
No User tokens: Pattern for @identifier is not included
Simplified special character escaping: Only escapes \ + - ! ( ) " space at lib/Languages/Galach/TokenExtractor/Text.php54
Override createGroupBeginToken: Provides simplified version at lib/Languages/Galach/TokenExtractor/Text.php72-75 that always sets domain to empty string

Sources: lib/Languages/Galach/TokenExtractor/Text.php24-76 lib/Languages/Galach/TokenExtractor/Full.php25-88

Feature Comparison

Feature	Full	Text	Notes
Whitespace	✓	✓	Identical pattern
Mandatory (`+`)	✓	✓	Identical pattern
Prohibited (`-`)	✓	✓	Identical pattern
Logical NOT (`!`, `NOT`)	✓	✓	Identical patterns
Logical AND	✓	✓	Identical pattern
Logical OR	✓	✓	Identical pattern
Group Begin	✓	✓	Full supports domain prefix
Group End	✓	✓	Identical pattern
Word terms	✓	✓	Full supports domain prefix
Phrase terms	✓	✓	Full supports domain prefix
Tag terms (`#tag`)	✓	✗	Full only
User terms (`@user`)	✓	✗	Full only
Domain prefixes	✓	✗	Full only
Escaped characters	10 chars	7 chars	Full adds `:`, `#`, `@`

Use case guidance:

Use Full: When you need complete Galach syntax including field-scoped queries, tag filtering, or user mentions
Use Text: For simple text search interfaces where only words, phrases, and basic operators are needed

Sources: lib/Languages/Galach/TokenExtractor/Full.php25-88 lib/Languages/Galach/TokenExtractor/Text.php24-76

Implementing Custom Extractors

Custom token extractors enable extending or modifying the lexical rules of Galach or creating entirely new query languages.

Required Implementation

To create a custom extractor, extend TokenExtractor and implement two abstract methods:

Sources: lib/Languages/Galach/TokenExtractor.php61-73

Example: Minimal Custom Extractor

Here's a conceptual outline of a minimal custom extractor that only supports words and whitespace:

Step 1: Define the expression type map

Return an array mapping regex patterns to token types. Each regex must have a (?<lexeme>...) named capture group:

Step 2: Implement term token creation

Check which named capture groups are present and create the appropriate token:

Sources: lib/Languages/Galach/TokenExtractor/Full.php41-44 lib/Languages/Galach/TokenExtractor/Full.php46-87

Regex Pattern Guidelines

When defining custom patterns:

Always use /Au flags:
- /A anchors match at current position
- /u enables UTF-8 mode
Define (?<lexeme>...) capture group: Required for all patterns at lib/Languages/Galach/TokenExtractor.php56-57
Order matters: Expressions are tried in array order. More specific patterns should come before general ones.
Use negative lookbehind for escaping: Prevent matching escaped characters:
```
(?<!\\\\)["] // Match quote not preceded by backslash
```
Use lookahead/lookbehind for word boundaries: Prevent operators from matching inside words at lib/Languages/Galach/TokenExtractor/Full.php31-33
Handle multi-byte characters: The getByteOffset() method at lib/Languages/Galach/TokenExtractor.php120-123 converts character positions to byte offsets for proper UTF-8 handling

Sources: lib/Languages/Galach/TokenExtractor.php26-49 lib/Languages/Galach/TokenExtractor/Full.php25-39

Testing Custom Extractors

The test at tests/Galach/Tokenizer/TokenExtractorTest.php17-36 demonstrates testing for PCRE errors. Custom extractors should test:

Valid token extraction for all supported patterns
Proper handling of escaped characters
BAILOUT token creation for unrecognized input
RuntimeException throwing for malformed data
Byte offset handling for multi-byte characters

Sources: tests/Galach/Tokenizer/TokenExtractorTest.php1-75

Integration with Tokenizer

The TokenExtractor is used by the Tokenizer class (see Tokenization Process) which:

Calls extract() at each position in the input string
Collects returned tokens into a TokenSequence
Handles BAILOUT tokens for unrecognized input
Maintains position tracking as extraction progresses

The extractor is passed to the Tokenizer constructor, making it easy to swap implementations or use custom extractors.

Sources: lib/Languages/Galach/TokenExtractor.php26-49

Refresh this wiki

URL: https://deepwiki.com/netgen/query-translator/4.2-token-extractors