Voozh

Have you ever wondered what happens when you type SELECT * FROM employees WHERE salary > 80000? How does a database turn that string of text into actual results?

I built QueryForge — a complete SQL query engine from scratch in Python — to find out. No SQLite, no DuckDB, no database dependencies at all. Just pure Python implementing every stage of the SQL pipeline.

GitHub: github.com/hajirufai/queryforge ⭐

The Architecture

Every SQL query passes through five stages:

SQL String → Lexer → Parser → Planner → Optimizer → Executor → Results

┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐
│ Lexer │───▶│ Parser │───▶│ Planner │───▶│ Optimizer │───▶│ Executor │
│ (tokens)│ │ (AST) │ │ (plan) │ │(optimized)│ │ (results)│
└─────────┘ └──────────┘ └──────────┘ └───────────┘ └──────────┘

Let me walk through each stage with real code.

Stage 1: The Lexer — Turning Text into Tokens

The lexer is the simplest piece conceptually, but getting it right matters. It scans the SQL string character-by-character and produces a stream of tokens.

from dataclasses import dataclass
from enum import Enum, auto

class TokenType(Enum):
 SELECT = auto()
 FROM = auto()
 WHERE = auto()
 INTEGER = auto()
 FLOAT = auto()
 STRING = auto()
 IDENTIFIER = auto()
 STAR = auto()
 EQUALS = auto()
 # ... 60+ token types total

The lexer handles some tricky cases:

Keywords are case-insensitive — SELECT, select, and SeLeCt are identical
String escaping — 'it''s' becomes it's
Comments — both -- line and /* block */ styles
Quoted identifiers — "column with spaces" is a valid column name

class Lexer:
 def __init__(self, source: str):
 self.source = source
 self.pos = 0
 self.tokens: list[Token] = []

 def tokenize(self) -> list[Token]:
 while self.pos < len(self.source):
 ch = self.source[self.pos]
 if ch.isspace():
 self.pos += 1
 elif ch.isdigit():
 self._read_number()
 elif ch == "'":
 self._read_string()
 elif ch.isalpha() or ch == '_':
 self._read_identifier_or_keyword()
 # ... operators, punctuation, etc.
 self.tokens.append(Token(TokenType.EOF, None))
 return self.tokens

For SELECT name, salary FROM employees WHERE salary > 80000, the lexer produces:

[SELECT, IDENTIFIER("name"), COMMA, IDENTIFIER("salary"), FROM,
 IDENTIFIER("employees"), WHERE, IDENTIFIER("salary"), GREATER,
 INTEGER(80000), EOF]

Stage 2: The Parser — Building an Abstract Syntax Tree

This is where it gets interesting. I used a recursive-descent parser — each SQL clause is handled by a dedicated method that can call other parsing methods.

The AST node for a SELECT statement captures the full query structure:

@dataclass
class SelectStatement:
 columns: list[AliasedExpr] # SELECT ...
 from_table: TableRef | None # FROM ...
 joins: list[JoinClause] # JOIN ...
 where: Expression | None # WHERE ...
 group_by: list[Expression] # GROUP BY ...
 having: Expression | None # HAVING ...
 order_by: list[OrderByItem] # ORDER BY ...
 limit: int | None # LIMIT ...
 offset: int | None # OFFSET ...
 distinct: bool = False # DISTINCT

Expression parsing uses precedence climbing. The key insight: OR has lower precedence than AND, which has lower precedence than comparisons:

def _parse_expression(self) -> Expression:
 return self._parse_or()

def _parse_or(self):
 left = self._parse_and()
 while self._match(TokenType.OR):
 right = self._parse_and()
 left = BinaryOp(left, "OR", right)
 return left

def _parse_and(self):
 left = self._parse_not()
 while self._match(TokenType.AND):
 right = self._parse_not()
 left = BinaryOp(left, "AND", right)
 return left

The parser handles the full SQL grammar including subqueries, CASE expressions, and DML/DDL:

-- All of these parse correctly:
SELECT e.name, d.name AS dept
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > (SELECT AVG(salary) FROM employees)
 AND e.dept_id IN (SELECT id FROM departments WHERE budget > 300000)
ORDER BY e.salary DESC
LIMIT 10;

Stage 3: The Planner — From AST to Logical Plan

The planner converts the AST into a tree of plan nodes. This is where relational algebra meets code:

@dataclass
class ScanNode:
 table_name: str
 alias: str | None = None

@dataclass
class FilterNode:
 child: PlanNode
 predicate: Expression

@dataclass
class JoinNode:
 left: PlanNode
 right: PlanNode
 join_type: str # INNER, LEFT, RIGHT, CROSS
 condition: Expression | None

@dataclass
class AggregateNode:
 child: PlanNode
 group_by: list[Expression]
 aggregates: list[AliasedExpr]

For a query like:

SELECT dept_id, COUNT(*) AS cnt
FROM employees
WHERE salary > 50000
GROUP BY dept_id
ORDER BY cnt DESC
LIMIT 5

The planner builds:

Limit(5)
 └── Sort(cnt DESC)
 └── Aggregate(GROUP BY dept_id)
 └── Filter(salary > 50000)
 └── Scan(employees)

Stage 4: The Optimizer — Predicate Pushdown

Even a simple optimizer can make a huge difference. QueryForge implements predicate pushdown — the most impactful optimization in query engines.

The idea: push filter conditions as close to the data source as possible. If you're joining two tables but only need rows where salary > 80000, filter before the join instead of after.

class Optimizer:
 def optimize(self, plan: PlanNode) -> PlanNode:
 plan = self._push_predicates(plan)
 plan = self._eliminate_redundant_projects(plan)
 return plan

 def _push_predicates(self, node: PlanNode) -> PlanNode:
 if isinstance(node, FilterNode):
 if isinstance(node.child, JoinNode):
 join = node.child
 # If predicate only references left table, push it down
 if self._references_only(node.predicate, join.left):
 join.left = FilterNode(join.left, node.predicate)
 return join
 return node

Before optimization:

Filter(e.salary > 80000)
 └── Join(e.dept_id = d.id)
 ├── Scan(employees AS e)
 └── Scan(departments AS d)

After optimization:

Join(e.dept_id = d.id)
 ├── Filter(e.salary > 80000)
 │ └── Scan(employees AS e)
 └── Scan(departments AS d)

Stage 5: The Executor — Running the Plan

The executor walks the plan tree recursively. Each node type has its own execution logic:

class Executor:
 def execute(self, node: PlanNode) -> tuple[list[str], list[Row]]:
 if isinstance(node, ScanNode):
 return self._exec_scan(node)
 elif isinstance(node, FilterNode):
 return self._exec_filter(node)
 elif isinstance(node, JoinNode):
 return self._exec_join(node)
 elif isinstance(node, AggregateNode):
 return self._exec_aggregate(node)
 # ... etc.

JOIN execution uses nested-loop joins (simple but correct):

def _exec_join(self, node: JoinNode):
 left_cols, left_rows = self.execute(node.left)
 right_cols, right_rows = self.execute(node.right)

 result = []
 for l_row in left_rows:
 matched = False
 for r_row in right_rows:
 combined = {**l_row, **r_row}
 if node.condition is None or self._eval_expr(node.condition, combined):
 result.append(combined)
 matched = True

 # LEFT JOIN: include unmatched left rows with NULLs
 if not matched and node.join_type == "LEFT":
 null_right = {c: None for c in right_cols}
 result.append({**l_row, **null_right})

 return left_cols + right_cols, result

The executor also includes 20+ scalar functions (UPPER, LOWER, COALESCE, ABS, ROUND, SUBSTR, etc.), LIKE pattern matching with wildcards, and full NULL handling.

What You Can Do with It

QueryForge supports real-world SQL:

from queryforge import Engine

engine = Engine()
engine.load_csv("employees", "employees.csv")
engine.load_csv("departments", "departments.csv")

# Complex analytical query
result = engine.execute("""
 SELECT d.name AS department,
 COUNT(*) AS headcount,
 AVG(e.salary) AS avg_salary,
 CASE WHEN AVG(e.salary) > 85000 THEN 'High'
 WHEN AVG(e.salary) > 70000 THEN 'Mid'
 ELSE 'Low' END AS pay_band
 FROM employees e
 JOIN departments d ON e.department_id = d.id
 WHERE e.is_active = TRUE
 GROUP BY d.name
 HAVING COUNT(*) >= 2
 ORDER BY avg_salary DESC
""")

print(result.to_table())

It also includes:

Rich CLI with interactive REPL
FastAPI REST API for querying over HTTP
Docker support for deployment
157 passing tests across all components

What I Learned

Building a SQL engine taught me more about databases than years of using them:

Parsing SQL is harder than it looks — the grammar is context-sensitive (is IN a keyword or identifier?) and full of edge cases
The optimizer matters enormously — even simple predicate pushdown can be the difference between scanning 1,000 rows vs. 1,000,000
NULL handling is tricky — three-valued logic means NULL = NULL is NULL, not TRUE
Aggregate vs. scalar context — knowing when you're inside a GROUP BY vs. a regular SELECT changes everything

Try It

git clone https://github.com/hajirufai/queryforge.git
cd queryforge
pip install -e ".[dev]"

# Interactive REPL
queryforge repl data/employees.csv -t employees

# One-shot query
queryforge query "SELECT * FROM employees ORDER BY salary DESC LIMIT 5" data/employees.csv -t employees

The full source is ~4,400 lines of Python with 157 tests. Every stage is documented and tested independently.

GitHub: github.com/hajirufai/queryforge

This is part of my "Building from Scratch" series where I implement complex systems from first principles. Previously: VectorLite (vector search engine) and TaskFlow (DAG workflow orchestration).

URL: https://dev.to/hajirufai/building-a-sql-query-engine-from-scratch-in-python-lexer-parser-optimizer-executor-1n58

⇱ Building a SQL Query Engine from Scratch in Python: Lexer Parser Optimizer Executor - DEV Community