What the Question Parser Extracts from a User String: Keywords, Scope, Shape, Decomposition, Clarification

Enterprise Document Intelligence [Vol.1 #6b] – The five field families the parser reads straight from the user’s question, with the code that fills each one

angela shi

Jun 17, 2026

31 min read

👁 Image

Photo by Merve Bayar, via Pexels.

part of the question-parsing brick of Enterprise Document Intelligence, a series that builds an enterprise RAG system from four bricks: parsing, question parsing, retrieval, and generation. Article 6_a (thesis) made the case for parsing the question and showed the two consumer briefs the parsed row splits into. This article walks what the parser extracts from a user string: keywords, the expected answer shape and type, scope hints, decomposition for compound questions, and the clarification field for inputs too vague to act on. Article 6c (dispatch) covers what the parser then decides on top of those fields, using the document’s profile.

👁 Image

where this article sits in the series: Article 6 (question parsing), the extraction half, inside Part II (the four bricks) – Image by author

A user types one string. “What is the maximum coverage amount? Don’t confuse it with the deductible, they’re often listed together.” The parser turns it into a row of typed columns: a topic, an expected answer shape (an amount), a scope hint (this contract), a negative cue (not the deductible) routed to the generation brief, and a layout hint (often listed together) that retrieval can use. Each piece becomes its own column on question_df. This article walks the five field families one at a time, with the code that fills each one and the typed schema that holds it.

👁 Image

Question parsing produces one row in question_df plus satellite tables, with two derived views feeding retrieval and generation – Image by author

1. The five field families the parser fills

A question is more than its words. It also tells you what shape the answer should take, where to look in the document, whether it’s compound or too vague to act on. The parser captures each of these and writes it as a column on question_df. Read the rest of this article as a menu of what’s available, not as a checklist.

The columns fall into two groups.

What the parser reads from the question itself.

Keywords: Tokens to feed retrieval. Several sources combine: explicit (the user named them), direct (extracted from the question), LLM rewrites, an expert concept dictionary, and high-signal regex anchors like L131-1.
Answer shape and answer type: Two orthogonal axes: the expected cardinality (single, listing, table, tree, nested_json) and the value type (text, amount, date, iban, address, …).
Scope: Where in the document to look: a page, a chapter, a section, a layout (table / image), a date range, a jurisdiction.
Decomposition: Sub-questions when the question is compound.
Clarification: A short follow-up question when the input is too vague to act on.

What the parser then decides (using the document’s profile on top of the above).

Dispatch: How much surrounding context to read and return, which chunk strategy to use, which model to call. All cascaded from the answer type, the matched concept, and the project’s defaults.
Activations: Which bricks to run (TOC navigation, embeddings, cross-references, …), downgraded by what the document supports.

Each category becomes one or more columns on question_df. Projects pick what they need, skip the rest, and add new columns as failure modes show up: a policy_number for an insurance broker, a patient_id for medical RAG, a regulation_year for legal. The sub-sections walk each one.

1.1 Keywords

Retrieval needs words to search the document with. The parser picks them out of the question and hands them over. The user’s wording almost never matches the document’s wording on the first try, so the parser collects from several sources at once.

Here is the minimal schema we’ll grow as the section goes:

class ParsedQuestion(BaseModel):
 original_question: str
 keywords: list[str]

For “What is the maximum coverage amount?”, the parser produces:

ParsedQuestion(
 original_question="What is the maximum coverage amount?",
 keywords=["maximum", "coverage", "amount"],
)

Keywords inherit whatever typos the question carries. Pull tokens straight from “How does multi-head atention compare to self-atention?” and retrieval searches for atention, a string the document never contains. Zero hits, the system returns nothing, the user concludes the topic isn’t covered. The fix is a cheap pre-step that runs before keyword extraction: one LLM call that corrects typos and grammar without changing meaning, so the keywords come out clean.

def correct_spelling(question: str) -> str:
 """Fix typos and grammar without changing meaning."""
 prompt = f"""Fix any typos and grammar mistakes in the question below.
Do not change the meaning. Do not add or remove information. Return only
the corrected question, nothing else.
Question: {question}"""
 resp = client.responses.create(model="gpt-4.1-mini", input=prompt)
 return resp.output_text.strip()

In production, this is cached (the same question typed by multiple users gets corrected once) and skipped when the input is clean.

Let users name the keywords themselves. Some users (analysts, paralegals, anyone fluent in the document’s vocabulary) already know exactly which terms they want matched. A UI hint, “List exact terms to search for, separated by commas”, opens the highest-precision retrieval path the system has. The user’s tokens go in verbatim: weight 1.0, source direct, no LLM, no synonym expansion (unless they opt in). For “Please find ‘force majeure’, ‘rescission’, ‘event of default’ in this contract”, the parser pulls the three quoted phrases as-is. Faster, cheaper, more accurate than any LLM rewrite when the user can name the terms. The product side matters too: a “search terms” field next to the question box, or a system prompt instruction (“include the exact terms you want matched”), moves a measurable share of queries onto this path.

When the user doesn’t name terms explicitly, three parser-side sources fill in: LLM rewrites, an expert concept dictionary, and anchor regex.

Vocabulary mismatch is the first thing to break. The user asks about “the cap on what the insurer will pay” and the document says “limit of indemnity per occurrence.” The gap shows up everywhere in enterprise:

Insurance: “the cap on what the insurer will pay” → “limit of indemnity per occurrence”.
Legal: “what happens if we exit early” → “early termination provisions” or “rights of rescission”.
Finance: “how much we’ll get paid back” → “principal repayment schedule” or “redemption terms”.
Medical: “side effects” → “adverse events” or “contraindications”.

The keyword column has the wrong tokens; the search misses everything. Three sources combine to fill it with terms the document uses.

Source A: LLM rewrites: Reformulate the question into 3-5 phrasings that match how the document is likely to phrase the answer. The trick is to generate the language that surrounds the answer, not the answer itself. (This is the idea behind HyDE, Hypothetical Document Embeddings: generating a plausible passage, then embedding it, rather than embedding the question directly.)

# src/question/rewrite.py
def rewrite_query(question: str, domain_hint: str = "") -> list[str]:
 """Rewrite a user question into 3-5 queries phrased the way the
 relevant passage is likely to appear in the document."""
 prompt = f"""You are translating a user question into search queries that match
how the answer would be phrased in a {domain_hint or 'professional'} document.
Return 3 to 5 alternative phrasings. Use vocabulary the document is likely to use,
not the user's casual phrasing. Output one phrasing per line, no numbering.
User question: {question}"""
 resp = client.responses.create(model="gpt-4.1-mini", input=prompt)
 return [line.strip() for line in resp.output_text.splitlines() if line.strip()]

For “what happens if we exit early?” with domain_hint="commercial contract", the LLM rewrites the query into the five phrasings the document is likely to use: early termination provisions, conditions for exiting the agreement before the end of the term, termination for convenience, exit fees and penalties for early termination, rights of rescission and notice requirements.

In our deployments, a small domain_hint-driven rewrite consistently moves more on the per-failure-mode evaluation than picking the next embedding model up the leaderboard. Cheap to add, easy to roll back if it hurts.

Source B: the expert dictionary, our first satellite tables. LLM rewrites handle standard vocabulary. They don’t handle “premium” → “prime” (French insurance), “cotisation” (mutual societies), or “DDPE” (a specific insurance product code), because the LLM has rarely seen these mappings in its training data. Domain experts know the meaningful synonyms in their field. They maintain them in two satellite tables that together form the project’s keyword dictionary. This is where the system amplifies the expert at scale: year after year, the synonyms they collect accumulate into a relational asset the project owns, can grow, and can audit at any time.

The first holds the concepts themselves: one row per concept, with its definition and the document family it belongs to.

👁 Image

Concepts plus per-concept dispatch defaults that override answer-type defaults – Image by author

The second holds the keyword variants: one row per (concept, language, keyword), joined to concepts_df on the concept column.

👁 Image

One row per (concept, language, keyword) – Image by author

The language column makes the dictionary work on mixed-language corpora. Think of a French insurance group whose contracts arrive in French, English, sometimes Spanish. Without the language column the parser would pull the wrong variants. The keyword_priority separates strong matches (primary) from weaker ones (secondary); weight is the numeric companion used directly in the lexical score. Splitting concepts and keywords this way keeps each concept’s metadata (definition, document type) in one place instead of repeating it on every keyword row.

How the parser uses the two tables: When a keyword in the question matches a row in concept_keywords_df, the parser looks up the concept and pulls every variant (all languages, all priorities). Retrieval then searches for all of them at once. If a keyword could fit several concepts (a prime could be an insurance premium or a bonus or a primary number), the parser asks the LLM to pick, passing the definition column from concepts_df for each candidate:

def disambiguate_concept(
 question: str,
 candidates: pd.DataFrame,
 *,
 system_prompt: str = (
 "Pick the concept that best fits the user's question. "
 "Reply with the concept name only, no explanation."
 ),
) -> str:
 """`candidates`: rows from concepts_df that share a matched keyword."""
 options = "\n".join(
 f"- {r['concept']}: {r['definition']}" for _, r in candidates.iterrows()
 )
 user_msg = f"Question: {question}\n\nCandidates:\n{options}"
 resp = client.responses.create(
 model="gpt-4.1-mini",
 input=[{"role": "system", "content": system_prompt},
 {"role": "user", "content": user_msg}],
 )
 return resp.output_text.strip()

The LLM picks the one that fits the user’s question and the rest of the row resolves from there.

The two tables grow with the project: every missed retrieval that traces back to a vocabulary mismatch becomes a new row.

Embeddings can help discover candidates: embed every distinct noun phrase in the corpus, cluster (HDBSCAN, ~5-min cluster size), find clusters that overlap a known concept’s existing keywords, and let the expert decide which new variants to admit. What retrieval reads is the validated table, not the raw embeddings. The expert validates each entry before it lands in the dictionary. That’s what makes the system fit the domain reliably.

Standard RAG tutorials assume embedding similarity will handle synonyms automatically. For some domains it does. For specialized enterprise vocabulary it doesn’t.

Here is where the series’s editorial position on embeddings shows up. The usual path is to pick the “best” embedding model by measuring recall@k on a labeled query/document set, then lean on that model for synonym matching. We do the opposite: solve synonyms with the validated dictionary, keep embeddings as a fallback for cases the dictionary doesn’t cover yet, and use them upstream as a discovery tool rather than the primary retrieval signal. The retrieval brick details where exactly embeddings sit in the funnel.

When the value set is closed, enumerate it. Some concepts have a finite, known list of values: country names, currency codes, US state names plus abbreviations, insurance product codes, drug names from the company formulary, vehicle makes and models the broker sells. For these, the dictionary stops growing organically and becomes a one-shot bulk insert: list every value, every common spelling, every translation, every abbreviation, every variant that turns up in real documents.

Take countries. If a user asks “What is the coverage in Germany?”, the document almost certainly contains Germany, Allemagne, Deutschland, DE, or DEU as a literal token. There is no pattern to detect, no regex that captures “country-ness” across the 195 of them. The fix is to load all of them (or the subset the corpus uses) into concept_keywords_df with concept = "country", one row per (language, spelling). Any match in the question tells the parser the user is asking about a country, and retrieval scopes accordingly.

The contrast with answer_types_df is sharp. Amounts, dates, IBANs, percentages all share structural patterns that regex catches. Country names share no structure. The two satellite tables solve two different kinds of problem: one for things with patterns, the other for closed sets you can list end to end.

Source C: anchor keywords: Sometimes the user gives you tokens that are too important to risk losing in the embedding average: internal product codes, regulatory references, clause numbers, identifiers.

“Does article L131-1 of the insurance code apply here?”

The token “L131-1” is the entire query. If you embed the whole sentence, that token gets diluted with “article”, “insurance code”, “apply here”. Extract the high-signal tokens and route them to a lexical index: BM25, the classical keyword-scoring algorithm that weights rare terms more heavily: alongside the embedding query.

# src/question/keywords.py
import re
ANCHOR_PATTERNS = [
 r"\b[A-Z]+\d+(?:[-/]\d+)*\b", # L131-1, ISO-9001, RC-2024 (any number of leading caps)
 r"\b[A-Z]{2,}\.[A-Z]{2,}(?:-\d+)?\b", # NIST-style codes: ID.AM, PR.AC-1
 r"\b[A-Z]{3,}\b", # GDPR, RCP, SLA
 r"\b\d{4,}\b", # year, identifier numbers
]
def extract_anchor_keywords(question: str) -> list[str]:
 """Extract high-signal tokens for lexical retrieval."""
 found: list[str] = []
 for pattern in ANCHOR_PATTERNS:
 found.extend(re.findall(pattern, question))
 return list(dict.fromkeys(found)) # de-dup, preserve order

The three sources combine in the parsed output:

class Keyword(BaseModel):
 text: str
 weight: float = 1.0
 source: Literal["direct", "llm_expansion", "expert_dictionary", "anchor"]
 semantic_group: str | None = None
 is_regex: bool = False
class ParsedQuestion(BaseModel):
 original_question: str
 keywords: list[Keyword] # now structured

Why HyDE works, and why explicit keywords capture the same gain. A diagnostic on a real example, in Article 2 (embeddings’ failure modes), section 3.2, showed the raw query “how do I cancel my policy” losing to a lexical decoy. The HyDE rewrite injected rescission, terminate, written notice, renewal, and the target won by a margin of 0.169. The mechanism behind that effect is exactly what the satellite tables above already do, without the embedding round-trip.

Three mechanisms run at the same time when HyDE works:

Keyword and synonym expansion: The LLM, generating a hypothetical answer, naturally uses the domain vocabulary the question lacks. “How do I exit a contract early?” yields “early termination, notice period, exit fees, written notice.” These are the words real passages contain. The HyDE embedding captures them, retrieval finds the matches.
Register matching: The hypothetical answer adopts the document’s register (formal, technical, domain-specific). The conversational register of the question is replaced with the register of the answer. The distance shrinks in the embedding space. Real, but smaller than mechanism 1 on enterprise corpora with a bounded vocabulary.
Latent semantic associations: The LLM activates trained associations that are not lexical. The most-cited reason in the literature for HyDE’s gain, but the smallest contributor in a bounded domain.

In enterprise contexts (one bounded domain: insurance, legal, medical) mechanism 1 explains most of the HyDE gain. The keywords the LLM produces in the hypothetical answer are exactly the ones the document uses. The fictitious text and the document share the same vocabulary. The embedding step captures that shared vocabulary, nothing more.

If mechanism 1 dominates, extracting the keywords directly captures the same benefit at lower cost. The round-trip embedding disappears. Retrieval becomes auditable (the expert sees which keywords matched). What the project builds is a permanent asset (the dictionary in concept_keywords_df) instead of regenerating the hypothesis at every query.

That is why explicit keyword extraction wins over HyDE in the contexts the series targets: bounded domain, expert available, auditable retrieval required. In open-domain consumer search without those constraints, HyDE keeps an edge because mechanisms 2 and 3 weigh more.

In practice the comparison is simple. HyDE asks for a rewrite, embeds it, runs cosine, one round-trip per query, no audit trail. The approach above makes one LLM call to extract the keywords once, looks them up in the dictionary, and the audit trail is the matched rows themselves. The dictionary is reusable across queries and grows with the project.

1.2 Tagging the answer shape and the answer type

Even with rich keywords, retrieval can return passages that match on the words but contain no actual answer. “What is the annual premium?” might match a sentence about premium quality with no monetary amount in sight. Or it matches a passage about an unrelated prime (a bonus, a primary). The system has no way to tell the right hit from the wrong one because it doesn’t know what kind of answer it’s looking for.

The fix is to tag every question on two independent axes: answer_shape (how the answer is laid out: one value? a list? a table?) and answer_type (what each value contains: text? amount? date?). “What is the annual premium?” is (single, amount). “List the annual premiums by year” is (listing, amount). “List the exclusions of the contract” is (listing, text). The same type can travel under any shape, and the same shape can carry any type. Keeping the two separate lets the parser tag each one on its own.

Take amounts: A question of type amount triggers a regex pass alongside the keyword search, scanning for numeric tokens with currency symbols (\d[\d\s.,]*\s*(?:EUR|€|USD|\$)). “What is the annual premium?” now matches the line “Prime annuelle: 125 000 €” on two signals at once: the keyword prime is there AND the amount regex matches 125 000 € on the same line. Two-signal match is much stronger than keyword alone. Conversely, if retrieval finds keyword matches but no monetary amount anywhere in the candidate passages, the answer is probably not in the document; the system can return “no amount found” with confidence rather than guessing.

Same logic for dates: “When does coverage start?” expects a date. Retrieval scans for date patterns (ISO, locale, written-out) alongside keywords like date d’effet, commencement, start. If the keyword zone contains a clean parseable date, that’s almost certainly the answer; if not, the field is missing from the contract.

The type axis is open: Anything you can name and write a regex (or LLM check) for, you can register: text, amount, date, boolean, email, iban, policy_number, siren, percentage, duration, address, … The registry lives in answer_types_df, one row per registered type:

👁 Image

Per-type registry on the value axis – Image by author

The retrieval_patterns column (kept in the live DataFrame, omitted from the image for width) is what retrieval uses to confirm the type. The output_schema_ref column points to the Pydantic class generation renders into; the generation brick owns that side. The default_model column is the model the parser falls back to when no concept-level override applies: small types (amount, date, iban) land on a nano model, free-form text on mini. Adding a new type to the project is a single insert. Anything the parser can’t classify falls back to text and skips the regex confirmation.

The shape axis is closed and tiny: Five values cover what we’ve seen across real corpora: single (one value, the default), listing (a flat enumeration), table (rows × columns), tree (nested hierarchy), nested_json (a structured object with named sub-fields, e.g. an address as {street, city, zip}). The registry is so small it lives in a sibling satellite with two columns of defaults:

👁 Image

Per-shape defaults on the cardinality axis – Image by author

Single facts almost always live on one line in the top-ranked chunk, so sequential saves ⅔ of the tokens at k=3; listings, tables and trees need to be synthesised across passages, so combined is the safer default. The split is what lets “What is the annual premium?” (a (single, amount) question) and “List the annual premiums by year” (a (listing, amount) question) share the same type (amount, same regex, same value parsing) while routing differently at generation time.

class ParsedQuestion(BaseModel):
 original_question: str
 keywords: list[Keyword]
 answer_shape: Literal["single", "listing", "table", "tree", "nested_json"] = "single"
 answer_type: str = "text" # FK into answer_types_df

The label is a property of the question, not of the document. “What is the premium?” is a (single, amount) question whether the contract is two pages or two hundred. The two fields are independent: “List the exclusions” is (listing, text), “List the annual premiums of the contract” is (listing, amount).

Shape is a closed enum (five values, fixed for the project); type is open (one row per registered type in answer_types_df). The classification itself is folded into the consolidated parse_question call covered in Article 6_c (dispatch), with both registries injected into the LLM prompt so adding a new type or a new shape is a row insert, not a code change.

1.3 Scope: where to look in the document

A question often names where in the document to look. The parser captures these hints in two typed fields, both applied before keyword retrieval runs. StructuralHints holds the structural hints (page, TOC section, layout). ScopeFilters holds the corpus-level filters: the application layer can pass them in (when it knows the user’s jurisdiction, date range, etc.), or the parser can pull them out of the question when it names one explicitly.

Pages: “Show me page 3”, “Summarize pages 5 to 7”, “Compare page 2 and page 9”. Single page, range, and explicit list all collapse to a flat list of integers. pages_hint = [3], pages_hint = [5, 6, 7], pages_hint = [2, 9]. Retrieval then filters with one expression (page_df[page_df.page_num.isin(pages_hint)]) and does not branch on the shape of the hint. Hinted pages are kept even when no keyword matches them: the user pinned them explicitly, that is the answer surface.
Chapter / section: “What are the exclusions of this contract?”. Names a TOC entry. toc_section_hint = "Exclusions". Retrieval matches against the document’s actual TOC.
Layout: “the schedule table at the end”. The answer lives in a table, an image, or a header. layout_hint = "table". Tells retrieval to look at structured zones, not narrative text.
Date range / parties / jurisdiction: “What did we sign with Acme between 2022 and 2024?”. Corpus-level filters that trim candidate documents before retrieval runs (handled by a corpus-level index ahead of the document-scoped pipeline). ScopeFilters(date_range=..., parties=["Acme"]).

The same convention extends to other formats. sheets_hint: list[str] carries Excel sheet names pinned in the question (“on the Pricing sheet”); slides_hint: list[int] carries PowerPoint slide numbers (“on slide 4”, “slides 7 to 9”). Volume 2 picks up both. The unifying idea is that the user phrasing controls the scope. A consequence worth noting: on short documents (CV, single-page invoice, 1-2 page memo), the operator pins the only page (page 1) inside the question and the pipeline runs unchanged. No “short doc” mode, no chunk-strategy switch, no retrieval bypass. The same code path that handles a 1000-page corpus query happens to scope to a one-page document. Article 8 (generation) develops this convention from the dispatcher’s side.

class ScopeFilters(BaseModel):
 sections: list[str] = Field(default_factory=list)
 date_range: tuple[str, str] | None = None
 parties: list[str] = Field(default_factory=list)
 jurisdictions: list[str] = Field(default_factory=list)
 page_range: tuple[int, int] | None = None
 custom: dict = Field(default_factory=dict)
class StructuralHints(BaseModel):
 # WHERE the answer lives
 toc_section_hint: str | None = None
 # The likely TOC section or chapter ("Exclusions", "Schedule A").
 # Retrieval matches against the document's actual TOC.
 pages_hint: list[int] | None = None
 # Pages pinned by the question. Single ("page 3" -> [3]),
 # range ("pages 5 to 7" -> [5, 6, 7]), or list ("page 2 and 9"
 # -> [2, 9]) all collapse to a flat list at parse time.
 sheets_hint: list[str] | None = None # XLSX, Volume 2
 slides_hint: list[int] | None = None # PPTX, Volume 2
 layout_hint: Literal["text", "table", "image", "header"] | None = None
 document_version: str | None = None
 # HOW MUCH context to read and return (retrieval consumes these)
 detection_context: Literal["line", "sentence", "paragraph"] = "line"
 # Granularity of the regex confirmation zone (section 2.2).
 # "line" for amount/date, "paragraph" for narrative.
 answer_context: Literal["line", "paragraph", "page", "section", "chapter", "document"] = "paragraph"
 # How much surrounding text the generator receives.
 needs_summary: bool = False
 # True when the answer spans more than fits in a verbatim quote.

chunk_strategy and suggested_model are also per-question dispatch decisions, but they’re not structural (they don’t describe the document, they describe how the pipeline calls the LLM). They live at the top level of ParsedQuestion, not on StructuralHints, and the dispatch companion (Article 6_c) walks the cascade that fills them. The same is true for three fields kept on StructuralHints (detection_context, answer_context, needs_summary), which describe how much text the regex pass and the generator read, not where the answer lives. Article 6_c covers their defaults too.

A few examples of what the parser produces:

👁 Image

Five sample questions and the context columns the parser fills – Image by author

Detection is one LLM call with structured Pydantic output. Regex was tempting (catch “page 3” with r"page\s+(\d+)") and works for a handful of trivial cases, but breaks on the rest: “in the warranty section” (modifier before noun), “the recap table at the end” (no number), “the chapter on liability” (synonym), “the appendix” (no keyword). The LLM covers all of them in one round-trip, returns clean Pydantic, and stays maintainable.

# src/question/hints.py
HINTS_PROMPT = (
 "Read the user's question and extract structural hints about WHERE the answer "
 "lives in the document AND HOW MUCH context the answer needs.\n\n"
 "- toc_section_hint: the section or chapter the user pointed at, matched against "
 "typical document TOC entries (e.g. 'Exclusions', 'Limits', 'Schedule A'). null if "
 "no section is implied.\n"
 "- pages_hint: flat list of page numbers the user pinned. Single ('page 3' -> [3]), range ('pages 5 to 7' -> [5, 6, 7]) and list ('page 2 and 9' -> [2, 9]) all collapse to a list. null otherwise.\n"
 "- layout_hint: 'table' / 'image' / 'header' if the question implies a layout.\n"
 "- detection_context: granularity of the regex confirmation zone. 'line' for a "
 "single fact, 'sentence' for short prose, 'paragraph' for narrative.\n"
 "- answer_context: how much surrounding text the generator receives. 'line' for "
 "a single value, 'paragraph' for an explanation, 'page' for a recap, 'section' "
 "for a topic, 'chapter' or 'document' for a broad summary.\n"
 "- needs_summary: True if the answer spans more than fits in a verbatim quote."
)
def extract_hints(question: str, *, system_prompt: str = HINTS_PROMPT) -> StructuralHints:
 resp = client.responses.parse(
 model="gpt-4.1-mini",
 input=[
 {"role": "system", "content": system_prompt},
 {"role": "user", "content": question},
 ],
 text_format=StructuralHints,
 )
 return StructuralHints.model_validate_json(resp.output_text)

Layout hints matter more than they look. If the user says “it’s usually in an image”, that’s a big clue. Most parsers strip images or replace them with placeholders. Knowing the answer lives in an image tells you to look at the OCR output of figures, or to flag the question for vision-language processing. Without the hint, the pipeline searches text-only and finds nothing.

One LLM call per concern, or one LLM call total? The article shows each concern separately so you can understand (and test) each piece on its own. That’s also how you’d build the pipeline: one helper at a time, validating each column before adding the next. In production, once you’re confident the schema is right, you fold everything into one consolidated call: one round-trip, one prompt, one place where the LLM has full context. Article 6_c (dispatch), section 3.1, shows that consolidated parse_question end to end.

At single-document scale, scope filters constrain where in the document retrieval looks. At corpus scale (Part IV), they become SQL clauses on the corpus index. Same idea, different machinery.

1.4 Compound questions

Some questions can’t be answered by retrieving any single passage, no matter how well you phrase the query. They contain multiple sub-questions packed into one.

“Are the indemnification and liability caps consistent in this contract?”

A comparison. Retrieve the indemnification clause, retrieve the liability cap clause, then compare them.

“Does this contract include a non-compete clause, and if so, for how long?”

A conditional question. Step one: is there a non-compete? Step two (only if yes): what’s the duration?

“What is the annual premium and what are the main exclusions?”

Two unrelated facts joined by “and”. Different passages, independent answers.

Four patterns come up often enough to name.

Independent: Two unrelated facts joined by “and”: “What is the premium and what are the exclusions?” The orchestrator runs pdf_qa twice in parallel; merging is just {"sub_questions": [{"q": ..., "answer": ...}, ...]} keyed by the parsed sub-question.

Sequential: The second part depends on the first: “Who is the insured party, and what is their address?” The address is of the insured party: you have to identify the party first, then look up their address. The orchestrator runs the sub-questions in order and substitutes the previous answer into the next sub-question’s keywords.

Unified: Two terms that refer to the same concept, not two questions: “What are the exclusions and limitations?” In most policy documents, exclusions and limitations appear together in the same section. Decomposing this into two sub-questions duplicates work. Keep it as one question with both terms boosted in keyword retrieval.

Conditional: A condition narrows the scope: “If the policy is for commercial property, what is the fire coverage limit?” The condition becomes a scope filter; the actual question is “what is the fire coverage limit”, run on the subset of the document that matches.

A cheap rule of thumb for picking the pattern is the “and” test: replace “and” with “; also”. If it still reads naturally, the parts are independent. If it reads awkwardly, they’re unified. For the harder cases, an LLM classifies. has_compound_indicators is the cheap pre-filter (a regex looking for \band\b, \bor\b, ?...?, multiple imperatives); llm_classify_decomposition is one structured-output call returning a Decomposition:

class Decomposition(BaseModel):
 pattern: Literal['single', 'independent', 'sequential',
 'unified', 'conditional'] = 'single'
 sub_questions: list[str] = Field(default_factory=list)
 conditional_filter: dict | None = None

def decompose(question: str) -> Decomposition:
 if not has_compound_indicators(question):
 return Decomposition(pattern='single')
 return llm_classify_decomposition(question)

Decomposition adds latency and cost. Detect compound structure first; only decompose when the pattern warrants it.

In one production deployment, ~30% of user questions in the first month of beta were compound, and the pipeline was returning incomplete answers on most of them. Adding compound decomposition at the parsing layer raised user satisfaction sharply with no other change in the pipeline.

1.5 Clarification: when the system asks back

Some questions are too vague to act on at all, and no amount of parsing makes them actionable.

“What’s the cap?” Cap on what? Liability? Indemnification? Damages? Premium? “Show me the latest version.” Latest version of what document? Latest as of when? “Compare with last year’s.” Last year’s what?

If the system charges ahead and guesses, it produces subtly wrong answers users either don’t catch (worse) or do catch and stop trusting the system (also worse). The fix is the cheapest possible: detect that the question can’t be acted on, and ask the user back instead of running the pipeline.

The parsed question carries this case in two fields:

class ParsedQuestion(BaseModel):
 # ... other fields
 suggested_clarification: str | None = None
 ambiguity_reason: str | None = None

When suggested_clarification is set, the orchestrator returns it before running the pipeline:

“Several aspects of this question are ambiguous. Could you specify: which limit (coverage, deductible, sublimit), and which policy if you have multiple?”

A simple rule: if the question uses a word that points back to something else (this, that, the latest, last year’s, the cap) and the context isn’t available from conversation history or scope, ask. Detection runs inside the consolidated parse_question call covered in the dispatch companion (Article 6_c), as one sub-task of the shared parse prompt: “return a short follow-up question if the input is too vague, null otherwise”. The interface should make answering cheap: a short follow-up, two or three suggested options, done.

Few production systems do this. The default tends to be: charge ahead and answer something. The result is a steady drip of subtly wrong answers, and users who slowly lose trust in the system. Catching ambiguity at parsing time is cheaper than catching it after retrieval has returned the wrong passages.

2. Conclusion

Question parsing turns one noisy user string into a typed, relational brief. Five families of columns on question_df carry what the parser reads straight from the user’s question:

Keywords (with expert-dictionary expansion) anchor the retrieval search.
Answer shape and answer type tell generation what schema to return.
Scope hints filter where in the document to look.
Compound decomposition breaks the question into sub-questions when needed.
Clarification asks the user back when the question is too ambiguous.

Each column is a place where the project’s expert vocabulary, the document’s structure, and the user’s intent meet. Adding a parsing capability means adding a column, not a new function.

Two more families sit on top of these, decided by the parser after it sees the document profile: the dispatch decisions (chunk strategy, model, answer-context window) and the activation flags that turn bricks off when they do not fit. Both belong to the dispatch companion (Article 6_c), which also covers the routing of each column to the brick that consumes it.

Sources and further reading

The article reframes the HyDE technique from Gao et al. (HyDE, ACL 2023) as offline keyword work: the load-bearing piece is the keywords the hypothetical answer contains, not the embedding step itself. The query-rewriting line (Ma et al., RRR, EMNLP 2023) is the closest published precedent for the corrected_question + rewrites part of the structured plan. The four compound-question patterns map onto Self-Ask (Press et al., Self-Ask, EMNLP Findings 2023) plus IRCoT (Trivedi et al., IRCoT, ACL 2023). The closest lineage to “question becomes a typed object that downstream code consumes” is text-to-SQL semantic parsing (Yu et al., Spider, EMNLP 2018). Volume 3 (Agentic Bricks) returns to runtime tool-picking on top of the structured plan defined here.

Same direction as the article:

Gao, Ma, Lin, Callan, Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE), ACL 2023 (arXiv:2212.10496). The HyDE technique the article reframes: offline keyword work, not online hypothetical-document embedding.
Ma, Gong, He, Zhao, Duan, Query Rewriting for Retrieval-Augmented Large Language Models (RRR), EMNLP 2023 (arXiv:2305.14283). Closest published precedent for the corrected_question + rewrites part of the structured plan.
Press et al., Measuring and Narrowing the Compositionality Gap in Language Models (Self-Ask), EMNLP Findings 2023 (arXiv:2210.03350). Compound-question decomposition pattern that the four patterns in this article extend.
Trivedi et al., Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (IRCoT), ACL 2023 (arXiv:2212.10509). Sequential sub-questions; complements Self-Ask for compound queries.
Yu et al., Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task, EMNLP 2018 (arXiv:1809.08887). Canonical text-to-SQL benchmark; the closest lineage to “question becomes a typed object that downstream code consumes”.

Different angle, different context:

Wang et al., Query2doc: Query Expansion with Large Language Models, EMNLP 2023 (arXiv:2303.07678). A single LLM-generated expansion of the user query is enough to lift retrieval. The context is in-domain open QA; this article handles enterprise corpora where a full structured plan (corrected question + suggested prompts + answer shape + generation brief) earns its keep.
Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS 2023 (arXiv:2302.04761). The model decides when and which tool to call inline, with no upfront question parsing. Volume 3 (Agentic Bricks) develops this line on top of the structured plan defined here.

Earlier in the series:

Document Intelligence: series intro. What the series builds, brick by brick, and in what order.
Baseline Enterprise RAG, from PDF to highlighted answer. The four-brick pipeline end to end: PDF in, highlighted answer out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity wins (synonyms, typos, paraphrase), where it predictably breaks (unknown terms, negation, term-vs-answer relevance), and how to use it anyway.
Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder adds over bi-encoder embeddings, measured, and when it is worth the latency.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and finetuning optimize the wrong thing; route by question type instead.
From regex to vision models: which RAG technique fits which problem. Two axes, document complexity and question control, that pick the technique for each case.
10 common RAG mistakes we keep seeing in production. Ten production mistakes, organized brick by brick, with the fix for each.
Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing brick: the document’s nature, signals, and summary.
Stop returning flat text from a PDF: the relational shape RAG needs. The second half of the parsing brick: the relational tables every downstream brick reads.

Written By

angela shi

See all from angela shi

Llm Agent, Llm Applications, Parsing, Question Answering, Rag

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/what-the-question-parser-extracts-from-a-user-string-keywords-scope-shape-decomposition-clarification/