VOOZH about

URL: https://huggingface.co/datasets/next-tat/TAT-QA

⇱ next-tat/TAT-QA · Datasets at Hugging Face


Dataset Viewer
Duplicate

TAT-QA

TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research over more complex and realistic tabular and textual data, especially those requiring numerical reasoning.

The unique features of TAT-QA include:

  • The context given is hybrid, comprising a semi-structured table and at least two relevant paragraphs that describe, analyze or complement the table;
  • The questions are generated by the humans with rich financial knowledge, most are practical;
  • The answer forms are diverse, including single span, multiple spans and free-form;
  • To answer the questions, various numerical reasoning capabilities are usually required, including addition (+), subtraction (-), multiplication (x), division (/), counting, comparison, sorting, and their compositions;
  • In addition to the ground-truth answers, the corresponding derivations and scale are also provided if any.

In total, TAT-QA contains 16,552 questions associated with 2,757 hybrid contexts from real-world financial reports.

For more details, please refer to the project page: https://nextplusplus.github.io/TAT-QA/

Data Format

{
 "table": { # The tabular data in a hybrid context
 "uid": "3ffd9053-a45d-491c-957a-1b2fa0af0570", # The unique id of a table
 "table": [ # The table content which is 2d-array
 [
 "",
 "2019",
 "2018",
 "2017"
 ],
 [
 "Fixed Price",
 "$ 1,452.4",
 "$ 1,146.2",
 "$ 1,036.9"
 ],
 ...
 ]
 },
 "paragraphs": [ # The textual data in a hybrid context comprising at least two associated paragraphs to the table
 {
 "uid": "f4ac7069-10a2-47e9-995c-3903293b3d47", # The unique id of a paragraph
 "order": 1, # The order of the paragraph in all associated paragraphs, starting from 1
 "text": "Sales by Contract Type: Substantially all of # The content of the paragraph
 our contracts are fixed-price type contracts.
 Sales included in Other contract types represent cost
 plus and time and material type contracts."
 },
 ...
 ],
 "questions": [ # The questions associated to the hybrid context
 {
 "uid": "eb787966-fa02-401f-bfaf-ccabf3828b23", # The unique id of a question
 "order": 2, # The order of the question in all questions, starting from 1
 "question": "What is the change in Other in 2019 from 2018?", # The question itself
 "answer": -12.6, # The ground-truth answer
 "derivation": "44.1 - 56.7", # The derivation that can be executed to arrive at the ground-truth answer
 "answer_type": "arithmetic", # The answer type including `span`, `spans`, `arithmetic` and `counting`.
 "answer_from": "table-text", # The source of the answer including `table`, `table` and `table-text`
 "rel_paragraphs": [ # The orders of the paragraphs that are relied to infer the answer if any.
 "2"
 ],
 "req_comparison": false, # A flag indicating if `comparison/sorting` is needed to answer the question whose answer is a single span or multiple spans
 "scale": "million" # The scale of the answer including `None`, `thousand`, `million`, `billion` and `percent`
 }
 ]
}

Citation

@inproceedings{zhu2021tat,
 title = "{TAT}-{QA}: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance",
 author = "Zhu, Fengbin and
 Lei, Wenqiang and
 Huang, Youcheng and
 Wang, Chao and
 Zhang, Shuo and
 Lv, Jiancheng and
 Feng, Fuli and
 Chua, Tat-Seng",
 booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
 month = aug,
 year = "2021",
 address = "Online",
 publisher = "Association for Computational Linguistics",
 url = "https://aclanthology.org/2021.acl-long.254",
 doi = "10.18653/v1/2021.acl-long.254",
 pages = "3277--3287"
}
Downloads last month
822

Models trained or fine-tuned on next-tat/TAT-QA

Paper for next-tat/TAT-QA