paper stringlengths 10 10 | relevant_tables listlengths 1 3 | tables listlengths 3 55 | fulltext stringlengths 27.1k 359k | question stringlengths 36 601 | answer stringlengths 1 126 | plan stringlengths 76 1.47k ⌀ |
|---|---|---|---|---|---|---|
2401.06769 | [
[
"\\begin{table*}[h!]\n",
"\\centering\n",
"\\begin{tabularx}{\\textwidth}{@{}Xrrrrrrrrr@{}}\n",
"\\toprule\n",
"& \\multicolumn{3}{c}{M2M-100-418M} & \\multicolumn{3}{c}{SMaLL-100} & \\multicolumn{3}{c}{NLLB-200-1.3B} \\\\\n",
"\\cmidrule(lr){2-4} \\cmidrule(lr){5-7} \\cmidrule(lr){8-10}\n... | [
[
"\\begin{figure}\n",
" \\centering\n",
" \\includegraphics[width=\\columnwidth, trim=0 0.15cm 0 0, clip]{images/figure1}\n",
" \\caption{\n",
" NMT models can be used for inferring the likely original translation direction of parallel text.\n",
" In this example, the NMT... | % This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
\pdfoutput=1
% In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.
\documentclass[11pt]{article}
% Remove the "review" option to generate the final version.
%\usepackage[review]{acl... | Which model has the biggest difference in translation quality when translating into English versus from English, and what is the value of that difference? | NLLB-200-1.3B. 64.71 | SELECT all models
LOOP for each mode
SELECT all language pair containing en(English)
LOOP for each language pair containing en (English)
COMPUTE diff = abs(score translating into English − score translating from English)
COMPUTE max diff for the model
COMPUTE argmax max diff across all models
RETURN... |
2401.06769 | [
[
"\\begin{table}\n",
"\\centering\n",
"\\begin{tabularx}{\\columnwidth}{@{}Xrrr@{}}\n",
"\\toprule\n",
"Language Pair & \\(\\rightarrow\\) & \\(\\leftarrow\\) & Avg. \\\\\n",
"\\midrule\n",
"HT~~en\\biarrow cs & 88.24 & 80.62 & 84.43 \\\\\n",
"HT~~en\\biarrow de & 70.40 & 88.43 & ... | [
[
"\\begin{figure}\n",
" \\centering\n",
" \\includegraphics[width=\\columnwidth, trim=0 0.15cm 0 0, clip]{images/figure1}\n",
" \\caption{\n",
" NMT models can be used for inferring the likely original translation direction of parallel text.\n",
" In this example, the NMT... | % This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
\pdfoutput=1
% In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.
\documentclass[11pt]{article}
% Remove the "review" option to generate the final version.
%\usepackage[review]{acl... | Can we detect the translation direction for Czech-English better for human translation or neural machine translation? | neural translation | SELECT avg detection scores for (en-cs) for human translations from Table 1
SELECT avg detection scores for (en-cs) for neural machine translations from Table 2
IF human detection score > NMT detection score
RETURN human translation
ELSE
RETURN neural machine translation
|
2410.21272 | [["\\begin{table}[h]\n"," \\centering\n"," \\caption{Accuracy of the analyzed models on arithm(...TRUNCATED) | [["\\begin{figure}[t]\n"," \\centering\n"," \\includegraphics[width=0.96\\textwidth]{figures/o(...TRUNCATED) | "\n\\documentclass{article} %\n\\usepackage{arxiv_preprint,times}\n\n\n\\usepackage{amsmath,amsfonts(...TRUNCATED) | Calculate the average accuracy for addition and division operations for each model. | Llama3-8B: 0.945, Llama3-70B: 0.85, Pythia-6.9B: 0.525, GPT-J: 0.435 | "SELECT all models\nLOOP for each model\n COMPUTE average_accuracy = (accuracy for + operation + (...TRUNCATED) |
2410.21272 | [["\\begin{table}[h]\n"," \\centering\n"," \\caption{Accuracy of the analyzed models on arithm(...TRUNCATED) | [["\\begin{figure}[t]\n"," \\centering\n"," \\includegraphics[width=0.96\\textwidth]{figures/o(...TRUNCATED) | "\n\\documentclass{article} %\n\\usepackage{arxiv_preprint,times}\n\n\n\\usepackage{amsmath,amsfonts(...TRUNCATED) | Which operation reduced the average accuracy of Llama3-70B model? | divison | "SELECT Llama3-70B model\nCOMPUTE argmin accuracy across all operations\nRETURN operation with lowes(...TRUNCATED) |
2410.21272 | [["\\begin{table}[h]\n"," \\centering\n"," \\caption{Accuracy of the analyzed models on arithm(...TRUNCATED) | [["\\begin{figure}[t]\n"," \\centering\n"," \\includegraphics[width=0.96\\textwidth]{figures/o(...TRUNCATED) | "\n\\documentclass{article} %\n\\usepackage{arxiv_preprint,times}\n\n\n\\usepackage{amsmath,amsfonts(...TRUNCATED) | Which model has the highest average of the multiplication and division operations? | Llama3-8B | "SELECT all models\nLOOP for each model\n COMPUTE average_accuracy = (accuracy for × operation +(...TRUNCATED) |
2205.15544 | [["\\begin{table}[t]\n"," \\begin{center}\n"," \\caption{Comparison of BLEU scores for differe(...TRUNCATED) | [["\\begin{figure}\n"," \\begin{center}\n"," \\centerline{\\includegraphics[width=0.9\\textwid(...TRUNCATED) | "\\documentclass{article}\n\n\n% if you need to pass options to natbib, use, e.g.:\n\\PassOptionsToP(...TRUNCATED) | What is the average Nepali translation BLEU score for each method? | 7.2, 9.05, 10.0, 13.6 | "SELECT language pairs containing Ne (Nepali)\nLOOP for each method\n COMPUTE average_Nepali_BLEU(...TRUNCATED) |
2205.15544 | [["\\begin{table}[t]\n"," \\begin{center}\n"," \\caption{Comparison of BLEU scores for differe(...TRUNCATED) | [["\\begin{figure}\n"," \\begin{center}\n"," \\centerline{\\includegraphics[width=0.9\\textwid(...TRUNCATED) | "\\documentclass{article}\n\n\n% if you need to pass options to natbib, use, e.g.:\n\\PassOptionsToP(...TRUNCATED) | What are the languages mentioned in the table? | English, Nepali, Sinhala, Hindi, Gujarati, Finnish, Estonian, Latvian, Kazakh | SELECT all unique languages mentioned in the table
RETURN list of languages
|
2205.15544 | [["\\begin{table}[t]\n"," \\begin{center}\n"," \\caption{Comparison of BLEU scores for differe(...TRUNCATED) | [["\\begin{figure}\n"," \\begin{center}\n"," \\centerline{\\includegraphics[width=0.9\\textwid(...TRUNCATED) | "\\documentclass{article}\n\n\n% if you need to pass options to natbib, use, e.g.:\n\\PassOptionsToP(...TRUNCATED) | Which language family has the highest average BLEU score using our method? | Uralic | "SELECT Ours method\n\n\nLOOP for each language family\n COMPUTE average BLEU score across all la(...TRUNCATED) |
1903.00089 | [["\\begin{table*}[!ht]\n","\\begin{center}\n","\\setlength\\tabcolsep{4.9pt}\n","\\begin{tabular}{l(...TRUNCATED) | [["\\begin{table}[!ht]\n","\\begin{center}\n","\\begin{small}\n","\\setlength\\tabcolsep{3.8pt}\n","(...TRUNCATED) | " %\n% File naacl2019.tex\n%\n%% Based on the style files for ACL 2018 and NAACL 2018, which were\n%(...TRUNCATED) | Which translation direction has a higher BLEU score for Italian? | X→En | "SELECT X to EN BLEU scores for Italian from Table 1\nSELECT EN to X BLEU scores for Italian from Ta(...TRUNCATED) |
1911.02782 | [["\\begin{table*}[t!]\n"," \\centering\n"," \\scalebox{0.81}{\n"," \\begin{tabular}{p{(...TRUNCATED) | [["\\begin{figure}[t!]\n"," \\centering\n"," \\includegraphics[width=\\columnwidth]{gorc_links(...TRUNCATED) | "%\n% File acl2020.tex\n%\n%% Based on the style files for ACL 2020, which were\n%% Based on the sty(...TRUNCATED) | In which domain does S2ORC outperform SCIBERT in most of the task? | Biomed | "SELECT S2ORC and SCIBERT scores\n\n\nLOOP for each domain\n LOOP for each dataset in the domain\(...TRUNCATED) |
Dataset Card for SciTaRC
Dataset Summary
SciTaRC (Scientific Table Reasoning and Computation) is an expert-authored benchmark designed to evaluate Large Language Models (LLMs) on complex question-answering tasks over real-world scientific tables.
Unlike existing benchmarks that focus on simple table-text integration or single-step operations, SciTaRC focuses on composite reasoning—requiring models to execute interdependent operations such as descriptive analysis, complex arithmetic, and ranking across detailed scientific tables. To facilitate granular diagnosis of model failures, every instance includes an expert-annotated pseudo-code plan that explicitly outlines the algorithmic reasoning steps required to reach the correct answer.
Dataset Structure
The dataset is provided as a single test split containing 371 expert-annotated instances.
Data Instances
A typical instance contains the question, the ground truth answer, the expert-authored pseudo-code plan, the LaTeX representations of the relevant tables, and the full text of the source paper.
Data Fields
Each JSON object in the dataset contains the following fields:
paper(string): The arXiv ID of the source scientific paper (e.g.,"2401.06769").question(string): The complex, multi-step question asked about the tabular data.answer(string): The ground-truth answer.plan(string): The expert-authored pseudo-code blueprint outlining logical and mathematical operations (e.g.,SELECT,LOOP,COMPUTE).relevant_tables(list of lists of strings): The exact LaTeX source code for the specific table(s) required to answer the question.tables(list of lists of strings): The LaTeX source code for all tables and figures extracted from the paper.fulltext(string): The complete LaTeX source text of the original scientific paper, providing full context.
Citation
If you use this dataset, please cite our paper:
- Downloads last month
- 47
