VOOZH about

URL: https://dev.to/sendotltd/writing-a-sql-formatter-with-a-handwritten-tokenizer-5c66

⇱ Writing a SQL Formatter With a Handwritten Tokenizer - DEV Community


Writing a SQL Formatter With a Handwritten Tokenizer

A SQL formatter is a tokenizer plus a pretty-printer. The tokenizer recognizes keywords, identifiers, strings, numbers, operators, comments. The pretty-printer walks tokens and emits them with newlines before major clauses (SELECT, FROM, WHERE, JOIN) and indentation rules for nested parens. No parser needed — tokens are enough.

SQL formatting is a solved problem (sql-formatter, Prettier's SQL plugin, IDEs). But understanding how it works is valuable, and a minimal implementation fits in about 500 lines of vanilla JS.

🔗 Live demo: https://sen.ltd/portfolio/sql-formatter/
📦 GitHub: https://github.com/sen-ltd/sql-formatter

👁 Screenshot

Features:

  • Handwritten tokenizer (keywords, strings, numbers, operators, comments)
  • Pretty-print with configurable indent
  • Keyword case conversion (UPPER / lower / Capitalize)
  • Minify (single-line output)
  • Syntax highlighting
  • Example queries
  • Japanese / English UI
  • Dark / light theme
  • Zero dependencies, 57 tests

The tokenizer

SQL tokens are easier to identify than you'd think:

export function tokenize(sql) {
 const tokens = [];
 let i = 0;
 while (i < sql.length) {
 const c = sql[i];

 // Whitespace
 if (/\s/.test(c)) {
 let j = i;
 while (j < sql.length && /\s/.test(sql[j])) j++;
 tokens.push({ type: 'whitespace', value: sql.slice(i, j) });
 i = j;
 continue;
 }

 // Line comment
 if (c === '-' && sql[i + 1] === '-') {
 let j = i + 2;
 while (j < sql.length && sql[j] !== '\n') j++;
 tokens.push({ type: 'comment', value: sql.slice(i, j) });
 i = j;
 continue;
 }

 // Block comment
 if (c === '/' && sql[i + 1] === '*') {
 const end = sql.indexOf('*/', i + 2);
 const j = end === -1 ? sql.length : end + 2;
 tokens.push({ type: 'comment', value: sql.slice(i, j) });
 i = j;
 continue;
 }

 // String literal
 if (c === "'") {
 let j = i + 1;
 while (j < sql.length) {
 if (sql[j] === "'" && sql[j + 1] === "'") j += 2;
 else if (sql[j] === "'") { j++; break; }
 else j++;
 }
 tokens.push({ type: 'string', value: sql.slice(i, j) });
 i = j;
 continue;
 }

 // Identifier or keyword
 if (/[a-zA-Z_]/.test(c)) {
 let j = i;
 while (j < sql.length && /[a-zA-Z0-9_]/.test(sql[j])) j++;
 const word = sql.slice(i, j);
 const type = KEYWORDS.has(word.toUpperCase()) ? 'keyword' : 'identifier';
 tokens.push({ type, value: word });
 i = j;
 continue;
 }

 // Number
 if (/[0-9]/.test(c)) {
 let j = i;
 while (j < sql.length && /[0-9.eE+\-]/.test(sql[j])) j++;
 tokens.push({ type: 'number', value: sql.slice(i, j) });
 i = j;
 continue;
 }

 // Operators and punctuation
 const twoChar = sql.slice(i, i + 2);
 if (['<>', '<=', '>=', '!='].includes(twoChar)) {
 tokens.push({ type: 'operator', value: twoChar });
 i += 2;
 continue;
 }

 tokens.push({ type: c.match(/[(),;]/) ? 'punctuation' : 'operator', value: c });
 i++;
 }
 return tokens;
}

The non-obvious parts:

  • SQL string escapes are doubled quotes ('O''Brien'), not backslash. Most programming languages get this wrong on their first attempt.
  • Two-character operators (<>, <=, >=, !=) are matched before single characters.
  • Keywords are recognized by looking up the uppercased form in a Set — so select and SELECT both become keyword tokens.

Pretty-printing with clause breaks

The formatter walks the tokens and decides where to insert newlines:

const CLAUSE_STARTERS = new Set([
 'SELECT', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT',
 'UNION', 'INSERT', 'UPDATE', 'SET', 'DELETE', 'VALUES',
]);

const JOIN_KEYWORDS = new Set(['JOIN', 'LEFT', 'RIGHT', 'INNER', 'OUTER', 'CROSS']);

function format(tokens, options) {
 const out = [];
 let depth = 0;
 for (const tok of tokens) {
 if (tok.type === 'whitespace') continue; // strip, we rebuild

 if (tok.type === 'keyword') {
 const upper = tok.value.toUpperCase();
 if (CLAUSE_STARTERS.has(upper) || JOIN_KEYWORDS.has(upper)) {
 out.push('\n' + indent(depth));
 }
 }

 if (tok.value === '(') depth++;
 out.push(formatToken(tok, options));
 if (tok.value === ')') depth--;
 }
 return out.join('');
}

The pattern: strip original whitespace, rebuild from scratch. This means formatter output is deterministic — running it twice produces the same result regardless of input formatting.

Why handwritten instead of parser generator

A real SQL parser is complicated — SQL has dozens of dialects, hundreds of keywords, and context-sensitive grammar. But a formatter doesn't need full parsing. You just need to know where clause boundaries are, and those can be detected at the token level.

The tokenizer is ~150 lines. The formatter is ~100 lines. Together they handle 90% of real-world SQL queries well. The remaining 10% (weird vendor extensions, CTE formatting, window functions with complex OVER clauses) would need a real parser, but that's scope creep for most use cases.

Series

This is entry #79 in my 100+ public portfolio series.