by
RecQuest: Towards Estimating User Domain Knowledge in Conversational Recommender Systems
Abstract.
The ideal conversational recommender system (CRS) acts like a savvy salesperson, adapting its language and suggestions to a user’s expertise level. However, most current systems treat all users as experts, leading to frustrating and inefficient interactions when users are unfamiliar with a domain. Systems that can adapt their conversational strategies to a user’s knowledge level stand to offer a much more natural and effective experience. To enable such adaptation, a CRS must first be able to estimate a user’s domain knowledge from interaction signals. Yet, accurately estimating knowledge typically requires tailored interactions to elicit those signals in the first place, creating a fundamental chicken-and-egg problem. In this work, we take a first step toward breaking this dependency by introducing a new task: estimating user domain knowledge directly from conversational transcripts. A key obstacle to such estimation is the lack of suitable data; to our knowledge, no existing dataset captures the conversational behaviors of users with varying levels of domain knowledge. Furthermore, in most dialogue collection protocols, users are free to express their own preferences, which tends to concentrate on popular items and well-known features, offering little insight into how novices explore or learn about unfamiliar features. To address this, we design RecQuest, a game-with-a-purpose data collection protocol that elicits varied expressions of knowledge while using a target-aware CRS to guide interactions, release the resulting dataset, and provide baseline methods and analyses to support future work on user-knowledge-aware CRS.
1. Introduction
Conversational recommender systems (CRS) address the challenge of helping users find relevant items in vast catalogs by engaging them in interactive dialogues, typically involving preference elicitation based on item attributes (Jannach et al., 2022). Their effectiveness relies heavily on accurately capturing user preferences and interpreting their expressions about desired attributes (Pramod and Bafna, 2022). However, most current CRS approaches typically assume that users understand item attributes clearly and can directly map them to their preferences (Zhao et al., 2022; Lin et al., 2023; Zhang et al., 2025). In contrast, Kostric et al. (2024) propose a usage-oriented preference elicitation strategy, specifically targeting users with low-domain knowledge. In isolation, both approaches can lead to limited and inefficient interactions, as users differ significantly in domain knowledge, vocabulary, and preferred interaction style. As illustrated in Figure 1, a novice might express their needs differently than an expert when buying a digital camera. As a result, systems that rely on a single interaction strategy force users to adapt to the system’s language. However, recent work shows that task performance and user satisfaction improve significantly when a CRS instead adapts its interaction style to the user’s domain knowledge (Kostric et al., 2025).
To bridge this gap and enable the creation of more adaptive systems, we introduce the novel task of estimating user domain knowledge from conversations, allowing a CRS to tailor its interaction strategy—from preference elicitation to the presentation of recommendations and explanations. While empirical studies in information retrieval and recommender systems have shown that domain expertise shapes user interactions (White et al., 2009; Mao et al., 2018), estimating this knowledge from multi-turn dialogues remains a novel problem in the CRS setting.
A major obstacle in advancing the task of user domain knowledge estimation is the lack of datasets that capture such domain knowledge within a conversational context. Existing CRS datasets rely on open-ended protocols in which participants discuss only the attributes they are familiar with (Joko et al., 2024; Radlinski et al., 2019; Li et al., 2018). This approach leads to conversations focused primarily on well-known aspects of the domain, providing limited evidence about how users express needs, ask questions, or refine preferences when encountering unfamiliar aspects. Given these challenges, a critical question is: (RQ1) How can we design a data collection protocol that elicits signals of domain knowledge through engaging and natural conversations? We address this by developing a game-with-a-purpose (GWAP)111Here, “game” means an incentivized task with goals and rules, designed to elicit useful data from participants, rather than a game-theoretic model of strategic interaction. (Von Ahn, 2006; Balayn et al., 2022) protocol that combines open-ended interaction with a target-oriented objective. Specifically, participants are assigned a hidden target item, presented only as a concise background narrative. During the interaction, they must interact with the CRS to identify the item that best matches the narrative by providing preference statements, responding to elicitation prompts, and discussing the system’s recommendations. This design is distinct from both known-item search (Lee et al., 2006) and open-ended preference elicitation, while combining elements of each: participants are looking for a particular item without knowing its exact identity, while the open-ended protocol encourages the articulation, refinement, and clarification of preferences that expose signals of domain knowledge.
An inherent challenge in estimating user domain knowledge is determining the ground truth. This leads us to ask: (RQ2) How do objective knowledge scores compare to user self-assessments, and which provides a more reliable ground truth? We address this by contrasting participants’ self-assessed confidence ratings with their performance on a domain-specific technical questionnaire. We then determine the more robust ground truth by analyzing which measure better predicts actual task success, thereby identifying the calibration biases that undermine self-reported expertise.
With the ground truth established, we next examine which observable signals distinguish users with varying domain knowledge. Specifically, we ask: (RQ3) What identifiable traces differentiate expert behavior from novice behavior, and to what extent do these patterns make users with different knowledge levels distinguishable in practice? To answer this, we analyze the collected dialogues for measurable linguistic and interactional markers, including utterance length, vocabulary, and the frequency and specificity of the phrases used. Based on prior work (White et al., 2009; Mao et al., 2018), we expect that knowledgeable users articulate their needs through more precise, attribute-oriented language, whereas users with limited domain knowledge rely on high-level or usage-based descriptions and exhibit more frequent clarification behavior.
Finally, we investigate the feasibility of automatic knowledge estimation from dialogue. Specifically, (RQ4) How effectively can large language models (LLMs) estimate users’ domain knowledge from conversational data? To answer this, we examine whether these models can infer knowledge levels directly from raw dialogue transcripts. This involves assessing how well-separated the knowledge categories are, and whether users at different expertise levels exhibit sufficiently distinct linguistic, behavioral, or interactional markers to enable robust classification.
In summary, our contributions are as follows:
-
•
We introduce the task of estimating user domain knowledge in conversational recommenders directly from dialogues.
-
•
We design RecQuest, a game-with-a-purpose data collection protocol that uses background narratives and attribute clues to elicit natural conversations rich in domain knowledge signals.
-
•
We collect and release a novel CRS dataset capturing these domain-specific interactions. We note that this dataset is intended primarily to validate the proposed collection protocol and support initial feasibility analyses, rather than to serve as a large-scale supervised training benchmark.
-
•
We conduct a detailed analysis of the collected dialogues to uncover the behavioral and linguistic patterns that distinguish users with different knowledge levels.
-
•
We establish LLM-based baselines for estimating user domain knowledge from conversational data, and analyze the separability of knowledge levels on this challenging task.
The developed CRS, study protocols, and collected dialogues are made available at: https://github.com/iai-group/crs-knowledge.
2. Related Work
User domain knowledge has been previously investigated in the context of web search (White et al., 2009; Tabatabai and Shore, 2005; Mao et al., 2018). Prior work differs mainly in how expertise is determined and modeled. The two most common approaches for determining user knowledge are subjective self-assessments and objective knowledge tests. Self-assessment studies ask users to report their familiarity or perceived expertise, most often distinguishing between two and five categories (Noh et al., 2023; Ferrod et al., 2021; McAuley and Leskovec, 2013). Objective approaches instead—also popularly employed in the ‘search as learning’ realm—rely on domain quizzes or knowledge tests whose scores define knowledge levels (Kiseleva et al., 2015; Yu et al., 2018; Mao et al., 2018; Zhang et al., 2011, 2015; Collins-Thompson et al., 2016; Gadiraju et al., 2018). Research based on both approaches consistently finds that experts issue more structured and technical queries, use more precise and attribute-oriented vocabulary, reformulate less, while novices rely more on general, case-oriented descriptions and ask more clarification questions (Tabatabai and Shore, 2005; Noh et al., 2023). Prior work on adaptive preference elicitation further shows that user characteristics and domain familiarity shape interaction behavior and satisfaction, with more knowledgeable users responding better to structured, attribute-focused elicitation, and less knowledgeable users benefiting more from guided and example-based approaches (Knijnenburg and Willemsen, 2009).
Our work is situated at the intersection of two key research areas. We first review the main paradigms of CRS, as they provide the architectural context for the system we developed for our data collection. We then critically examine existing dialogue collection methods, highlighting the gap our protocol directly addresses.
Conversational Recommender Systems. Research on CRS is broadly categorized into two main groups. Attribute-based CRS cast the dialogue as a slot-filling exercise (Lei et al., 2020; Christakopoulou et al., 2016; Bernard et al., 2024), eliciting preferences from users about items or item attributes with fixed templates. Generation-based CRS, in contrast, generates free-form responses while recommending items, generally utilizing an end-to-end architecture that integrates the conversation and the recommendation components (Zhou et al., 2020). Our work is relevant to both paradigms, as estimating user knowledge can, among others, help tailor the elicitation strategy and the explanatory language. Our work employs a hybrid CRS that combines attribute-based preference elicitation with natural language generation for recommendations and explanations. The system maintains an internal representation of user preferences and supports interactions entirely in natural language.
Dialogue Collection Methods. Significant effort has been dedicated to collecting datasets that capture how users interact with conversational recommenders (Li et al., 2018; Hayati et al., 2020; Shah et al., 2018; Radlinski et al., 2019; Joko et al., 2024; Bernard and Balog, 2023; Budzianowski et al., 2018; Rastogi et al., 2020; Kang et al., 2019). Influential examples like ReDial (Li et al., 2018) and INSPIRED (Hayati et al., 2020) are created as a conversation between two participants, where one user (the “seeker”) states open-ended preferences and another (the “recommender”) suggests items that fit those preferences. However, these open-ended protocols often yield shallow conversations, with seekers frequently accepting recommendations uncritically. More importantly for our work, these protocols do not explicitly control for or capture user domain knowledge, making it difficult to study its effects. While other datasets use coached human-human protocols to emphasize naturalness (e.g., CCPE-M (Radlinski et al., 2019), MG-ShopDial (Bernard and Balog, 2023)) or focus on human-machine settings (e.g., Schema-Guided Dialogue (Rastogi et al., 2020), MultiWOZ (Budzianowski et al., 2018)), none explicitly incorporates user domain knowledge or controls for expertise.
3. Problem Statement
The realization of a truly adaptive conversational recommender system presents a fundamental “chicken and egg” problem. Ideally, a CRS should act like a savvy salesperson, adapting its language and suggestions to each user’s level of expertise. However, to effectively tailor these interactions, the system must first estimate the user’s domain knowledge; yet, accurate estimation often requires tailored interactions—such as specific elicitation strategies—to surface those knowledge signals in the first place. We aim to break this co-dependency by establishing the necessary groundwork for knowledge estimation independent of complex adaptation strategies. In this work, we focus on the first half of this challenge: designing a protocol to capture natural variations in domain knowledge and introducing baseline methods to estimate it. By isolating the estimation task, we provide the resources required to build systems that can eventually close the loop and personalize the experience dynamically.
Following prior work on modeling knowledge, we adopt a discretized representation, where users are grouped into a small number of categories (Brusilovsky and Millán, 2007). Specifically, we categorize users into three relative levels of knowledge: low, medium, and high. Low domain knowledge is characterized by infrequent and superficial interactions with the domain, where users have limited exposure and may lack a precise technical vocabulary or jargon. In contrast, high domain knowledge corresponds to frequent, hobby-like engagement, where users are more familiar with detailed attributes and exhibit a refined understanding. Medium knowledge falls between these extremes, indicating occasional but shallow interactions.
4. The RecQuest Data Collection Protocol
To study knowledge estimation, we require data that captures the nuanced conversational behaviors of users with varying levels of expertise—a resource absent in existing open-ended datasets. Current data collection protocols typically elicit unconstrained preferences, which tend to focus on popular, widely known items and allow users to accept recommendations with minimal scrutiny. In such settings, successful interaction does not require users to articulate domain-specific constraints, use lesser-known domain terminology, or explore trade-offs, making domain expertise largely irrelevant and consequently difficult to observe. To address this, we introduce RecQuest, a game-with-a-purpose protocol that incentivizes users to explore domain boundaries by assigning them a specific target item hidden within a background narrative. This section details the game mechanics, the item collection process across five distinct domains (bicycles, laptops, digital cameras, running shoes, and smartwatches), and the CRS architecture designed to support these goal-directed interactions.
4.1. Game Mechanics
Game Start. At the start, participants see a chat interface and a brief background story setting up a use scenario (Fig. 2). The story is generated from a specific item in the curated domain collection to anchor the conversation context without revealing the item itself. Each target item has two background-story variants: a concise (5 feature constraints, 80–100 words) and a more detailed version (10 feature constraints, 150–200 words); see Section 5.3 for details. This enables us to investigate whether the richness of contextual cues affects how participants formulate preferences or describe product features. Participants are instructed to find the item that best fits the scenario by engaging in a turn-based dialogue with the CRS.
Participant Turns. Participant turns are open-ended, allowing for various conversational strategies. Participants may state preferences, answer/ask questions, or critique recommendations. The ability to ask about an item’s properties allows them to probe which constraints or features are satisfied—much like in real-world interactions between customers and salespeople. Feedback on recommendations is optional. Each recommendation includes an image and a brief system-generated explanation of its fit to the participant’s stated needs. Participants may ask follow-up questions, refine their needs, or ignore the suggestion, using their own vocabulary.
A valid session requires the CRS to have produced at least two recommendations, which remain visible and can be selected at any point. Selecting an item triggers a confirmation prompt, asking whether they are certain of their choice, followed by a request for a brief feature-based justification. Participants are incentivized to identify the correct item by a monetary reward, fostering realistic behavior and increasing the external validity of the collected data.
CRS Turns. On each system turn, the CRS performs one of three possible actions: asking an elicitation question, answering a participant’s question, or providing a recommendation. All actions are generated by a target-aware conversational agent guided by a mix of rule-based heuristics and LLM output.
-
•
Elicitation. When a participant expresses several preferences at once, the CRS restricts them to at most three constraints. If more are listed, the system asks which ones are most important before continuing. This keeps early turns focused and prevents participants from giving the entire need description in a single turn. Elicitation questions are open-ended but specific to attributes present in the domain’s schema (e.g., “Is weight or range more important for your use?”). Thus, they maintain engagement through incremental exchanges.
-
•
Answering questions. When the participant asks about item attributes or domain facts, the CRS responds directly. All factual answers are LLM-generated and framed relative to the target item when applicable, so that information indirectly guides the user toward it without revealing the item’s identity.
-
•
Recommendation. A recommendation turn presents a single item from the domain’s curated collection accompanied by an image and a short natural-language explanation. The explanation is generated through a two-step LLM procedure: (i) the system identifies key differences between the candidate item and the hidden target, and (ii) produces an explanation that highlights how those differences relate to the stated preferences (example in Fig. 2). The system may suggest a previously recommended item if new preferences align with it.
Game End. A session ends when the participant confirms a selection. The session length is a configurable parameter (limited to 15 minutes in our work). If the time expires, a soft grace period begins, allowing participants to finalize a selection from the recommendations already shown. On confirmation, participants provide a brief, feature-based justification, after which the dialogue concludes.
4.2. Item Collection
We curate item collections for each of the five domains used in RecQuest —bicycles, laptops, digital cameras, running shoes, and smartwatches—using data from the Amazon Review Dataset (Hou et al., 2026). Each domain represents a semantically coherent item category with its own attribute structure, user intent patterns, and relevance criteria. This multi-domain setup allows us to evaluate knowledge estimation performance both within and across heterogeneous product types, providing a controlled yet diverse testbed for RecQuest. For every domain, we extract metadata, structured attributes, and item descriptions to create representations of available items. To ensure meaningful interaction, duplicates and highly similar items are removed using similarity-based filtering over item descriptions, resulting in diverse collections of distinguishable items.
4.3. CRS Architecture
Figure 3 shows the CRS architecture and data flow between its core components. At each participant’s turn, the system processes the input, updates its internal state, and determines the next intent before producing a response.
Preference Summarizer. The CRS maintains a structured JSON-like representation of atomic preference statements. After each turn, an LLM-based preference summarizer updates this record by merging newly mentioned constraints with existing ones, ensuring all active needs are captured concisely.
Item Retrieval. A retrieval step is triggered when the preference record contains at least two statements or when new ones are added. Dense vector representations of items are pre-computed and held in memory. Current preferences are encoded to retrieve relevant candidates. If preferences appear in a single turn, the system defers retrieval and asks the participant to prioritize which constraints matter most before continuing. This mechanism is designed to promote multi-turn interaction, encouraging the gradual refinement of needs through dialogue rather than attempting to complete the task in a single turn.
Decision Module. After preference updating and retrieval, the Decision Module applies a hybrid LLM and rule-based policy. It selects the next conversational intent based on the preference list, available recommendations, and the last message. It chooses one of four intents: redirect, answer, recommend, or elicit. Redirect is triggered when the dialogue drifts outside the designated domain. Answer is selected when the user asks factual or conversational questions. Recommendation is selected only if the LLM predicts recommendation and the following rule-based constraints are satisfied: retrieved candidates are available, the preference state contains new information, and at least three preferences have been stated; otherwise, the system falls back to Elicitation to gather additional constraints or clarify vague statements. Once the intent is selected, the system generates a response through specialized LLM modules.
Response Generation. Recommendation turns involve a multi-step process: an LLM selects the best candidate from the top-10 retrieved items given the current preferences and the latest exchange. A second model performs target grounding by comparing the selected candidate to the hidden target item, identifying which differences are relevant to the participant’s stated needs. This ensures that explanations focus on meaningful contrasts. A third model generates the final natural language response that presents the recommended item and integrates the contrastive explanation. The Answering intent also uses target grounding, particularly for follow-up questions about recommended items. Redirection and Elicitation use a single generation step to produce a contextually appropriate response.
5. Data Collection
Using the RecQuest protocol, we conducted a comprehensive study to collect a dataset of dialogues rich in domain knowledge signals. Crucially, our goal was not to construct a large-scale training resource, but rather to validate the protocol and determine whether it effectively elicits observable signals of domain knowledge in conversation. The study followed a four-stage flow designed to balance participant expertise across domains. This section presents a descriptive analysis of the collected data as well as an assessment of the participants’ subjective experiences to validate the protocol’s engagement and naturalness.
| Domain | #Dial | #Utt | #Turns | #Prefs | #Recs | Success |
|---|---|---|---|---|---|---|
| Bicycle | 79 | 1521 | 67.9% | |||
| Digital Camera | 79 | 1687 | 29.1% | |||
| Laptop | 98 | 2179 | 29.6% | |||
| Running Shoes | 179 | 3636 | 37.1% | |||
| Smartwatch | 80 | 1665 | 37.5% | |||
| Total | 515 | 10688 | 10.214.81 | 10.105.03 | 3.461.41 | 39.3% |
5.1. Study Flow
Our study consists of four sequential stages: a screening questionnaire, a domain knowledge assessment, the main task, and an exit questionnaire. This flow is designed to balance participants across domains and expertise levels while collecting both self-reported and measured indicators of domain knowledge. While the self-reported knowledge categories used in our study offer structure, we acknowledge that the boundaries between them are soft, with many participants falling in the middle of the spectrum.
Screening. Participants first complete a brief self-assessment of their expertise corresponding to each of the five domains used in this work: bicycles, laptops, digital cameras, running shoes, and smartwatches. For each domain, they classify themselves into one of three categories: novice, intermediate, or expert. After the screening, they are assigned to the domain with the fewest participants for their expertise level at the time of assignment. Thus, the self-assessment provides a coarse estimate of knowledge, which is used to balance participation, ensuring that each expertise level is represented approximately equally across domains.
Domain Knowledge Questionnaire. Following the self-assessment, participants complete a short factual knowledge test in the domain to which they are assigned. This step provides us with the objective estimate of their true expertise. For each domain, we constructed a set of 30 factual statements using ChatGPT 5.1, guided by principles from Item Response Theory (Baker and Kim, 2004; Cai et al., 2016). In particular, the questions were designed to span a range of difficulties and to test both surface-level familiarity and deeper domain-specific knowledge. The questionnaire adopts the certainty-in-knowledge format proposed by Vidigal (2025). Each item presents a factual statement about the domain, and participants rate its correctness using one of five options: Definitely True, Probably True, I don’t know, Probably False, or Definitely False. This captures both accuracy and confidence, and is more informative than binary correctness. To provide an additional validity check, human domain experts verified the suitability of the ChatGPT-generated questions.
Main Task. Participants are then presented with the instructions for RecQuest and perform the main task presented in Section 4.1. The assigned domain determines the item pool from which each participant is randomly assigned one of three target items and its corresponding background story. Each session lasts up to fifteen minutes, with the interaction process and system behavior governed by the game mechanics and dialogue flow previously outlined.
Exit questionnaire. After completing the main task, participants fill out an exit questionnaire in which they rate the quality of the interaction and provide open-ended feedback on their experience. These responses offer a subjective account of interaction satisfaction and help identify areas for improvement in both the system and the overall study design.
5.2. Participants
Participants were recruited from the Prolific platform,222https://www.prolific.com/ residing in the US or Europe with English as first or primary language and had a approval rate and previous submissions. Each participant could take part in only one task to avoid repeated exposure to the experimental setup. Participants completed the study fully online. After reading the task instructions, they proceeded through the screening questionnaire, domain knowledge test, main task, and exit questionnaire. Sessions lasted for 18 minutes on average. Participants received an hourly wage of £8, in line with the platform’s fair-wage guidelines, with a £1 bonus awarded on successful identification of the target items.
5.3. Generating Game States
Item Collection. To operationalize the RecQuest protocol, we instantiated item collections and target scenarios. From the Amazon dataset, we extracted all items belonging to categories related to the five domains. The raw collections ranged from 1,000 to 200,000 items per domain and included many near-duplicates differing only in brand or minor variations. To clean this data and maintain sufficient item differentiation, we narrowed each domain to a curated collection of 50 items. We achieved this by embedding item descriptions and applying k-means clustering to the embeddings, selecting the items nearest to the cluster centroids. The same embedding representations were later used for retrieval during gameplay.
Target Items. To create the interactive scenarios, we prepared a total of 30 background stories. For each domain, we selected three unique items from the curated collection and generated two versions of a story for each item: a long version and a short version. Each story was designed to contain a fixed number of implicit feature constraints—ten for the long version and five for the short—sufficient to uniquely identify the target item. Using an LLM, we converted these structured constraints into natural narratives without explicitly mentioning the corresponding item features.
LLM Usage. The CRS pipeline relies on multiple calls to an LLM, with different prompts and inputs for each component of the system. For all LLM calls in this study, we employed gpt-4.1-mini-2025-04-14, which early tests indicated offered a good balance between speed and output quality. All prompts used in the LLM-based components of the system can be found in the GitHub repository.
5.4. Analysis
We now address RQ1, assessing whether the RecQuest protocol achieves its primary goal: eliciting rich signals of domain knowledge through natural, engaging conversation. To answer this, we evaluate the RecQuest protocol through two complementary lenses: a quantitative analysis of dialogue statistics to verify that the protocol successfully induces the iterative constraint refinement necessary to surface knowledge signals, and a qualitative analysis of participant feedback to validate that the resulting interactions were perceived as natural and engaging.
Quantitative Analysis: Interaction Depth. The aggregate dialogue statistics provide strong evidence that the protocol elicits rich interaction signals. Overall, we collected a total of 515 high-quality conversations totaling over 10,000 utterances across five domains. A statistical overview of the collected data is presented in Table 1. Crucially, dialogues average turns and contain preference statements. This density of preference expression, combined with an average of recommendations per session, indicates that participants did not merely search for items, but engaged in iterative refinement over multiple exchanges. This confirms that the protocol successfully drives the articulation of detailed constraints, which serve as the proxies for domain knowledge in our analysis.
Qualitative Analysis: User Experience. To validate the “engaging and natural” aspect of RQ1, we examine the feedback from the post-task questionnaire. Responses were overall positive (mean rating of 3.89/5.0), with comments highlighting the educational and conversational nature of the system: P13: “I appreciated how the chatbot patiently asked clarifying questions to understand my priorities, which made the recommendations feel personalized and relevant. The detailed comparisons between different e-bikes, including specific features like battery capacity, motor type, braking system, and handling, helped me make an informed decision,” and P66: “The chatbot is very friendly and human like. I enjoyed every bit.”
Negative comments primarily focused on the study constraints, such as the limited collection (e.g., P396: “It could do with offering more [laptop] models.”) or the interaction constraint that prohibits information dumps (e.g., P502: “was quite frustrating because it didn’t take into account all of my wants all at once.”). However, this latter frustration indirectly validates the protocol design: it confirms that the system successfully prevented keyword-search behaviors, forcing users to explore the domain through dialogue. Overall, the positive feedback on personalization and the fluid interaction confirms our protocol’s effectiveness in creating an engaging experience at scale.
6. Estimating User Domain Knowledge
In this section, we begin by comparing participants’ self‑reported expertise with their actual performance on the domain‑knowledge questionnaires (RQ2). To move past the well-documented unreliability of self-assessments (Dang et al., 2020; Cole et al., 2010), we investigate multiple strategies for grouping participants by objective scores on the knowledge questionnaires. We then examine behavioral patterns that distinguish experts from novices in real interaction settings (RQ3). Building on these validated ground‑truth labels, we finally evaluate whether LLMs can infer user knowledge directly from dialogue transcripts (RQ4). Our approach uses both zero‑shot and few‑shot prompting to estimate overall knowledge levels, circumventing dependence on large training datasets while leveraging the natural language understanding capabilities of modern pre-trained models.
6.1. Methods
We estimate user domain knowledge directly from conversation transcripts using LLMs in zero-shot and few-shot settings. All methods share a unified prompt structure, shown in Fig. 4, comprising a fixed task description, a three-level expertise definition, placeholders for the domain, conversation history, optional few-shot examples, and a constrained JSON output format, to ensure consistency across settings. We evaluate three prompt variants. The holistic variant conditions on the full conversation history and produces a single overall knowledge estimate based on all user turns. The evidence-based variant also uses the full conversation but first instructs the model to extract domain‑specific phrases introduced by the user and to ground its prediction explicitly in this extracted evidence. The incremental variant operates turn-by-turn: given the conversation history, a previous estimate, and the latest user utterance, the model updates the predicted knowledge level only when new evidence appears, with the restriction that the estimate may increase or remain stable but not decrease—a pragmatic design choice that prioritizes stability over perfect Bayesian updating. Each variant is evaluated in both zero-shot and few-shot settings, with few-shot prompts providing three domain-matched examples (one for each label). We additionally train a supervised fine‑tuned model using the same prompt structure and annotated conversations.
6.2. Experimental Details
We evaluate two off-the-shelf LLMs for zero-shot and few-shot prompting: gpt-4.1-mini-2025-04-14—also used during data collection—and Meta-Llama-3.3-70B, an open-weight model. Supervised fine-tuning is performed on a 4bit quantized Meta-Llama-3-8B-Instruct model with QLoRA adapters and an encoder-only ModernBERT-large model suitable for a classification task, using 310 labeled conversations, with 205 held out for evaluation. Both models were trained for 10 epochs and learning rate. All models are evaluated using identical prompts, and expertise definitions.
7. Experimental Results
Here, we present our findings corresponding RQ2, RQ3, and RQ4.
7.1. RQ2: Establishing Ground Truth
To characterize user behavior accurately, we first require a reliable method for labeling expertise. We compare the reliability of subjective self-assessments against objective knowledge scores.
Figure 5 presents the distribution of item difficulties per domain, measured by the proportion of participants answering each question correctly. The spread is well‑balanced: no item was prohibitively difficult (each was solved by at least 15% of participants), very few were trivial (solved by more than 90%), and most fell between 40% and 80%. This distribution indicates that the questionnaire is reasonably well‑suited for distinguishing between different knowledge levels in our study setting.
We then examined how participants’ self‑reported expertise relates to their objective knowledge scores (Figure 6). The correlation is weak, with notable discrepancies—particularly among participants who identified as Intermediate or Expert yet performed poorly on the objective test. To empirically determine which metric serves as a more reliable ground truth, we compared task success rates under both labeling schemas. Intuitively, higher domain knowledge should facilitate the search process and lead to better outcomes. Indeed, when using objective scoring, Experts outperformed Novices (40.4% vs. 35.9%). However, when grouping users by self-assessment, this trend paradoxically reversed: self-identified Novices achieved a higher success rate (44.4%) than self-identified Experts (36.4%). This inversion—where perceived expertise negatively correlates with actual task performance—provides compelling empirical evidence for the Dunning–Kruger phenomenon (Kruger and Dunning, 1999), where overconfidence masks a lack of competence.
This confirms that self‑assessment is a noisy and misleading signal. In contrast, objective scores offer a more granular and verifiable basis for labeling. Consequently, for the remainder of our analysis (RQ3 and RQ4), we use objective score percentiles as an operationalized ground-truth proxy, defining Novices ( 20th percentile) and Experts ( 80th percentile).
7.2. RQ3: Behavioral and Linguistic Traces
Having established a robust ground truth, we examine how domain knowledge manifests within the conversation. Our goal is to determine whether experts and novices exhibit distinct behavioral patterns that are consistent enough to separate the two groups in practice. We analyze these interaction traces across three dimensions: dialogue flow, task outcomes, and linguistic specificity.
Dialogue Patterns and Task Outcomes. The dialogue intent progression, (Fig. 7 shows that Recommendation dominates early interactions, peaking around the third turn before gradually declining, as Answer About Recommendation intents increase over time. Although all groups follow this general trajectory, we observe subtle differences. Experts receive slightly more recommendations, consistent with their more frequent preference updates and refinements. Novices, in contrast, ask more clarifying questions, indicating a greater need for guidance. Beyond the flow of conversation, we also analyze success rate: the percentage of participants who correctly identified the target item. Success varied substantially across domains, from 29.1% (running shoes) to 67.8% (bicycles), with an overall rate of 39.3%. Experts (40.4%) outperformed novices (35.9%), suggesting that domain knowledge helps users steer the system more effectively.
Linguistic Markers of Expertise. To further answer RQ3, we first analyze basic linguistic features, such as the number of turns and utterance length. Experts engaged in slightly longer dialogues (10.68 messages on average) and used longer utterances (17.3 words) than novices (9.87 messages, 14.0 words).
Next, we compute domain-specific TF-IDF scores to quantify differences in keyword usage between the novice and expert groups. For each domain, we construct joint TF-IDF models over all documents to ensure a shared vocabulary and comparable scaling. Then, we average TF-IDF values within each group. The resulting normalized differences indicate how strongly each keyword is associated with Experts versus Novices (ranging from -1 to +1).
| Domain | Novice ( percentile) | Expert ( percentile) |
|---|---|---|
| Bicycle | recommend, local, new, need, stay | battery, range, assist, brake, gravel, trail, motor |
| Digital Camera | camera, bird, look, zoom, stability | lens, FujiFilm, autofocus, fast, travel |
| Laptop | best, game, work, time, speed, smooth | power, performance, light, windows, lenovo |
| Running Shoes | comfort, look, breathable | trail, Salomon, grip, protect |
| Smartwatch | smartwatch, good, brightness | battery life, charging, appearance |
Table 2 lists representative keywords highlighting the contrasts in keyword usage. Across domains, novices tend to use more general or evaluative terms (e.g., good, recommend, comfort), reflecting subjective impressions or purchase intentions. Experts, in contrast, rely more on technical or domain-specific vocabulary (e.g., brake, autofocus, performance), emphasizing specifications and functional attributes. These patterns suggest that Experts approach the CRS with a more analytical, comparison-oriented communication style, while novices rely on broader, experience-based descriptions.
This TF-IDF analysis provides preliminary evidence for a clear hypothesis: Experts tend to communicate through specific product attributes, while novices tend to communicate through intended usage or scenarios. To test this hypothesis systematically, we designed an LLM-based annotation pipeline. We few-shot-prompted gpt-4.1-mini-2025-04-14 to analyze each user utterance and extract phrases corresponding to two distinct categories: (i) Specific Attributes: Mentions of technical features, components, or specifications (e.g., “disc brakes”). (ii) Intended Usage: Descriptions of goals, contexts, or scenarios (e.g., “for daily rides to work”). This annotation moves beyond simple keyword counts and enables per-utterance quantification of communication style. The results strongly support our hypothesis, particularly at the conversation level. We found that Experts mentioned significantly more specific attributes on average (9.01 per conversation) than Novices (6.80). While Experts also provided more usage-based context (5.38 vs. 4.59), their overall linguistic preference still leaned more heavily toward attributes (62.6% of expressions) compared to Novices (59.8%).
7.3. RQ4: Automatic Knowledge Estimation
| Model | Accuracy | Macro-F1 | |
|---|---|---|---|
| Holistic | GPT-4.1 (zero-shot) | 0.305 | 0.296 |
| GPT-4.1 (few-shot) | 0.378 | 0.363 | |
| LLaMA 3.3 (zero-shot) | 0.354 | 0.354 | |
| LLaMA 3.3 (few-shot) | 0.490 | 0.357 | |
| Incremental | GPT-4.1 (zero-shot) | 0.359 | 0.332 |
| GPT-4.1 (few-shot) | 0.349 | 0.347 | |
| LLaMA 3.3 (zero-shot) | 0.364 | 0.347 | |
| LLaMA 3.3 (few-shot) | 0.490 | 0.356 | |
| Evidence-based | GPT-4.1 (zero-shot) | 0.339 | 0.321 |
| GPT-4.1 (few-shot) | 0.417 | 0.379 | |
| LLaMA 3.3 (zero-shot) | 0.417 | 0.382 | |
| LLaMA 3.3 (few-shot) | 0.213 | 0.154 | |
| Supervised | Llama 3-8B | 0.519 | 0.286 |
| ModernBERT | 0.524 | 0.353 |
In this section, we address RQ4 by examining how effectively large language models can estimate user domain knowledge from conversational interactions.
Table 3 reports classification performance using both accuracy and macro-averaged metrics to capture global correctness and class-level balance. Zero-shot LLaMA consistently yields strong macro-F1 scores, indicating robust alignment across all expertise levels. We observe contrasting few-shot effects: adding in-context examples consistently improves GPT-4.1 but degrades LLaMA 3.3, particularly in the evidence-based setting where performance drops substantially. Among strategies, incremental estimation proves most effective. While the fine-tuned ModernBERT achieves the highest overall accuracy (), it struggles with underrepresented categories, showing higher variability than the prompted LLMs. Ultimately, these moderate results highlight the inherent difficulty of the task: expertise cues are rarely explicit in single utterances, but rather distributed indirectly across multiple conversational turns.
To further understand the challenges in automatic knowledge estimation, Figure 9 contrasts confusion matrices for two of the best performing methods: zero-shot LLaMA and supervised ModernBERT. LLaMA shows distributed predictions with errors mostly confined to adjacent classes. Conversely, ModernBERT strongly favors the Intermediate class, maximizing overall accuracy at the expense of recall for Novices and Experts. This distinction explains the observed trade-off: ModernBERT over-centralizes predictions to optimize accuracy, whereas LLaMA maintains a better balance across the spectrum, resulting in higher macro-F1.
8. Conclusion and Future Directions
In this work, we introduce the novel task of estimating user domain knowledge from dialogue interactions, a crucial first step that enables the creation of adaptive conversational recommender systems. Our primary contribution is the design and validation of RecQuest, a gamified data collection protocol that successfully elicits natural, knowledge-rich interactions. By systematically creating information gaps between users and the system, RecQuest provokes the linguistic and behavioral traces necessary for studying domain knowledge in a CRS context. A key finding of our study is the necessity of objective knowledge assessment. Our comparison of self-reports versus objective scores revealed significant calibration biases, indicating that self-assessment is a noisy signal of expertise in this setting. Future research in this area must therefore rely on verified competence rather than user confidence. Finally, we established baselines for the automatic estimation of user knowledge. Our initial LLM-based experiments indicate that while broad distinctions between novices and experts are possible, fine-grained estimation remains an inherently difficult task. The subtle and evolving nature of knowledge signals in short conversational turns poses a significant challenge for current models, and the current results should thus be interpreted as initial evidence of task feasibility rather than robust practical performance.
Our protocol and dataset provide a foundation for studying this challenge and supporting future CRS that adapt elicitation and explanation strategies to users with different levels of understanding. More broadly, our results suggest that robust estimation will likely require additional data, and that the protocol may need refinement depending on the target application domain and conversational context. Finally, an inherent limitation of our current study is the assumption that differences in conversational behavior stem primarily from domain expertise. We recognize that real-world interactions are heavily influenced by confounding factors like individual verbosity, communication style, language proficiency, and personality traits. Isolating true knowledge signals from these individual characteristics remains an important challenge for future work.
Acknowledgements.
An unrestricted gift from Google partially supported this research.References
- F. B. Baker and S. Kim (2004) Item response theory: parameter estimation techniques. CRC press. Cited by: §5.1.
- A. Balayn, G. He, A. Hu, J. Yang, and U. Gadiraju (2022) Ready player one! eliciting diverse knowledge using a configurable game. In Proceedings of the ACM Web Conference 2022, WWW ’22, pp. 1709–1719. Cited by: §1.
- N. Bernard and K. Balog (2023) MG-shopdial: a multi-goal conversational dataset for e-commerce. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, pp. 2775–2785. Cited by: §2.
- N. Bernard, I. Kostric, and K. Balog (2024) IAI moviebot 2.0: an enhanced research platform with trainable neural components and transparent user modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, pp. 1042–1045. Cited by: §2.
- P. Brusilovsky and E. Millán (2007) User models for adaptive hypermedia and adaptive educational systems. In The Adaptive Web: Methods and Strategies of Web Personalization, pp. 3–53. Cited by: §3.
- P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić (2018) MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP ’18, pp. 5016–5026. Cited by: §2.
- L. Cai, K. Choi, M. Hansen, and L. Harrell (2016) Item response theory. Annual Review of Statistics and Its Application 3 (1), pp. 297–321. Cited by: §5.1.
- K. Christakopoulou, F. Radlinski, and K. Hofmann (2016) Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16, pp. 815–824. Cited by: §2.
- M. J. Cole, X. Zhang, J. Liu, C. Liu, N. J. Belkin, R. Bierig, and J. Gwizdka (2010) Are self-assessments reliable indicators of topic knowledge?. Proceedings of the American Society for Information Science and Technology 47 (1), pp. 1–10. Cited by: §6.
- K. Collins-Thompson, S. Y. Rieh, C. C. Haynes, and R. Syed (2016) Assessing learning outcomes in web search: a comparison of tasks and query strategies. In Proceedings of the 2016 ACM on conference on human information interaction and retrieval, CHIIR ’16, pp. 163–172. Cited by: §2.
- J. Dang, K. M. King, and M. Inzlicht (2020) Why are self-report and behavioral measures weakly correlated?. Trends in cognitive sciences 24 (4), pp. 267–269. Cited by: §6.
- R. Ferrod, F. Cena, L. Di Caro, D. Mana, and R. G. Simeoni (2021) Identifying users’ domain expertise from dialogues. In Adjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’21, pp. 29–34. Cited by: §2.
- U. Gadiraju, R. Yu, S. Dietze, and P. Holtz (2018) Analyzing knowledge gain of users in informational search sessions on the web. In Proceedings of the 2018 conference on human information interaction & retrieval, CHIIR ’18, pp. 2–11. Cited by: §2.
- S. A. Hayati, D. Kang, Q. Zhu, W. Shi, and Z. Yu (2020) INSPIRED: toward sociable recommendation dialog systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP ’20, pp. 8142–8152. Cited by: §2.
- Y. Hou, J. Li, X. Fu, Z. He, A. Yan, X. Chen, and J. McAuley (2026) Bridging language and items for retrieval and recommendation: benchmarking llms as semantic encoders. External Links: 2403.03952 Cited by: §4.2.
- D. Jannach, A. Manzoor, W. Cai, and L. Chen (2022) A survey on conversational recommender systems. ACM Computing Surveys 54 (5), pp. 1–36. Cited by: §1.
- H. Joko, S. Chatterjee, A. Ramsay, A. P. De Vries, J. Dalton, and F. Hasibi (2024) Doing personal laps: llm-augmented dialogue construction for personalized multi-session conversational search. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, pp. 796–806. Cited by: §1, §2.
- D. Kang, A. Balakrishnan, P. Shah, P. Crook, Y. Boureau, and J. Weston (2019) Recommendation as a communication game: self-supervised bot-play for goal-oriented dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP ’19, pp. 1951–1961. Cited by: §2.
- J. Kiseleva, A. M. García, J. Kamps, and N. Spirin (2015) The impact of technical domain expertise on search behavior and task outcome. External Links: 1512.07051 Cited by: §2.
- B. P. Knijnenburg and M. C. Willemsen (2009) Understanding the effect of adaptive preference elicitation methods on user satisfaction of a recommender system. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09, pp. 381–384. Cited by: §2.
- I. Kostric, K. Balog, and U. Gadiraju (2025) Should we tailor the talk? understanding the impact of conversational styles on preference elicitation in conversational recommender systems. In Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’25, pp. 164–173. Cited by: §1.
- I. Kostric, K. Balog, and F. Radlinski (2024) Generating usage-related questions for preference elicitation in conversational recommender systems. ACM Transactions on Recommender Systems 2 (2), pp. 1–24. Cited by: §1.
- J. Kruger and D. Dunning (1999) Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments.. Journal of personality and social psychology 77 (6), pp. 1121–1134. Cited by: §7.1.
- J. H. Lee, A. Renear, and L. C. Smith (2006) Known-item search: variations on a concept. Proceedings of the American Society for Information Science and Technology 43 (1), pp. 1–17. Cited by: §1.
- W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M. Kan, and T. Chua (2020) Estimation-action-reflection: towards deep interaction between conversational and recommender systems. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, pp. 304–312. Cited by: §2.
- R. Li, S. Ebrahimi Kahou, H. Schulz, V. Michalski, L. Charlin, and C. Pal (2018) Towards deep conversational recommendations. In Advances in Neural Information Processing Systems, NIPS ’18, Vol. 31, pp. 9748–9758. Cited by: §1, §2.
- A. Lin, Z. Zhu, J. Wang, and J. Caverlee (2023) Enhancing user personalization in conversational recommenders. In Proceedings of the ACM Web Conference 2023, WWW ’23, pp. 770–778. Cited by: §1.
- J. Mao, Y. Liu, N. Kando, M. Zhang, and S. Ma (2018) How does domain expertise affect users’ search interaction and outcome in exploratory search?. ACM Transactions on Information Systems 36 (4), pp. 1–30. Cited by: §1, §1, §2.
- J. J. McAuley and J. Leskovec (2013) From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd international conference on World Wide Web, WWW ’13, pp. 897–908. Cited by: §2.
- T. Noh, H. Yeo, M. Kim, and K. Han (2023) A study on user perception and experience differences in recommendation results by domain expertise: the case of fashion domains. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, CHI EA ’23, pp. 1–7. Cited by: §2.
- D. Pramod and P. Bafna (2022) Conversational recommender systems techniques, tools, acceptance, and adoption: a state of the art review. Expert Systems with Applications 203, pp. 117539. Cited by: §1.
- F. Radlinski, K. Balog, B. Byrne, and K. Krishnamoorthi (2019) Coached conversational preference elicitation: a case study in understanding movie preferences. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, SIGDIAL ’19, pp. 353–360. Cited by: §1, §2.
- A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2020) Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. In Proceedings of the AAAI conference on artificial intelligence, AAAI ’20, pp. 8689–8696. Cited by: §2.
- P. Shah, D. Hakkani-Tür, B. Liu, and G. Tür (2018) Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), NAACL ’18, pp. 41–51. Cited by: §2.
- D. Tabatabai and B. M. Shore (2005) How experts and novices search the web. Library & Information Science Research 27 (2), pp. 222–248. Cited by: §2.
- R. Vidigal (2025) Measuring belief certainty in political knowledge. Political Behavior 47 (2), pp. 529–551. Cited by: §5.1.
- L. Von Ahn (2006) Games with a purpose. Computer 39 (6), pp. 92–94. Cited by: §1.
- R. W. White, S. Dumais, and J. Teevan (2009) Characterizing the influence of domain expertise on web search behavior. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, pp. 132–141. Cited by: §1, §1, §2.
- R. Yu, U. Gadiraju, P. Holtz, M. Rokicki, P. Kemkes, and S. Dietze (2018) Predicting user knowledge gain in informational search sessions. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pp. 75–84. Cited by: §2.
- G. Zhang, C. Gao, W. Lei, X. Guo, S. Li, H. Chen, Z. Ding, S. Xu, and L. Wu (2025) Vague preference policy learning for conversational recommendation. ACM Transactions on Information Systems 43 (3), pp. 1–27. Cited by: §1.
- X. Zhang, M. Cole, and N. Belkin (2011) Predicting users’ domain knowledge from search behaviors. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR ’11, pp. 1225–1226. Cited by: §2.
- X. Zhang, J. Liu, M. Cole, and N. Belkin (2015) Predicting users’ domain knowledge in information retrieval using multiple regression analysis of search behaviors. Journal of the Association for Information Science and Technology 66 (5), pp. 980–1000. Cited by: §2.
- C. Zhao, T. Yu, Z. Xie, and S. Li (2022) Knowledge-aware conversational preference elicitation with bandit feedback. In Proceedings of the ACM Web Conference 2022, WWW ’22, pp. 483–492. Cited by: §1.
- K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, and J. Wen (2020) Towards topic-guided conversational recommender system. In Proceedings of the 28th International Conference on Computational Linguistics, COLING ’20, pp. 4128–4139. Cited by: §2.
