Dataset Viewer

Dataset Card for LEXTREME: A Multilingual Legal Benchmark for Natural Language Understanding

Dataset Summary

The dataset consists of 12 diverse multilingual legal NLU datasets. 6 datasets have one single configuration, 5 datasets have two or three configurations, and 1 dataset has three temporal epoch configurations. This leads to a total of 21 tasks (11 single-label text classification tasks, 5 multi-label text classification tasks and 5 token-classification tasks).

Use the dataset like this:

from datasets import load_dataset
dataset = load_dataset("joelito/lextreme", "swiss_judgment_prediction")

Supported Tasks and Leaderboards

The dataset supports the tasks of text classification and token classification. In detail, we support the folliwing tasks and configurations:

task	task type	configurations	link
Brazilian Court Decisions	Judgment Prediction	(judgment, unanimity)	joelito/brazilian_court_decisions
Swiss Judgment Prediction	Judgment Prediction	default	joelito/swiss_judgment_prediction
German Argument Mining	Argument Mining	default	joelito/german_argument_mining
Greek Legal Code	Topic Classification	(volume, chapter, subject)	greek_legal_code
Online Terms of Service	Unfairness Classification	(unfairness level, clause topic)	online_terms_of_service
Covid 19 Emergency Event	Event Classification	default	covid19_emergency_event
MultiEURLEX	Topic Classification	(level 1, level 2, level 3)	multi_eurlex
LeNER BR	Named Entity Recognition	default	lener_br
LegalNERo	Named Entity Recognition	default	legalnero
Greek Legal NER	Named Entity Recognition	default	greek_legal_ner
MAPA	Named Entity Recognition	(coarse, fine)	mapa
Ukrainian Court Decisions	Judgment Prediction	(pre_war, hybrid_war, full_scale)	overthelex/ukrainian-court-decisions

Languages

The following languages are supported: bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv, uk

Dataset Structure

Data Instances

The file format is jsonl and three data splits are present for each configuration (train, validation and test).

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

How can I contribute a dataset to lextreme? Please follow the following steps:

Make sure your dataset is available on the huggingface hub and has a train, validation and test split.
Create a pull request to the lextreme repository by adding the following to the lextreme.py file:
- Create a dict _{YOUR_DATASET_NAME} (similar to _BRAZILIAN_COURT_DECISIONS_JUDGMENT) containing all the necessary information about your dataset (task_type, input_col, label_col, etc.)
- Add your dataset to the BUILDER_CONFIGS list: LextremeConfig(name="{your_dataset_name}", **_{YOUR_DATASET_NAME})
- Test that it works correctly by loading your subset with load_dataset("lextreme", "{your_dataset_name}") and inspecting a few examples.

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@misc{niklaus2023lextreme,
 title={LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain},
 author={Joel Niklaus and Veton Matoshi and Pooja Rani and Andrea Galassi and Matthias Stürmer and Ilias Chalkidis},
 year={2023},
 eprint={2301.13126},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
}