VOOZH about

URL: https://towardsdatascience.com/training-t5-for-paraphrase-generation-ab3b5be151a2/

⇱ Training T5 for paraphrase generation | Towards Data Science


Skip to content

Training T5 for paraphrase generation

In my previous blog talking about TextGenie, I mentioned the issues I faced while collecting text data from scratch and using paraphrases…

3 min read
👁 Image generated using Imgflip
Image generated using Imgflip

In my previous blog talking about TextGenie, I mentioned the issues I faced while collecting text data from scratch and using paraphrases generated from T5(Text-To-Text Transfer Transformer) as one of the methods to augment text data. Having seen the model in action, let’s get our hands dirty with the training process😉

If you wish to walk along with me throughout, you can find the notebook for training here on my Github repo.

Tip: If you do not have a GPU, I suggest using Google Colaboratory for training the model.

Installing dependencies

Before proceeding, let’s get all the required packages handy using:

pip install simpletransformers datasets tqdm pandas

Dataset

We shall use the TaPaCo dataset for our task. The dataset consists of a total of 1.9 million sentences in 73 languages from which, we shall take sentences in English language.

Preprocessing the dataset(optional)

Before feeding the dataset to the model, it needs to be converted to pairs of input sentences and target sentences. The code for preprocessing can be found here as well as in the notebook.

Downloading already preprocessed dataset

If you do not wish to preprocess the data, I’ve already done the task for you. You can directly download the preprocessed version of the dataset from here.

Loading the dataset

Once done, you can load the dataset as:

import pandas as pd
dataset_df = pd.read_csv("tapaco_paraphrases_dataset.csv",sep="t")

Once loaded, the columns of the data need to be renamed. Also, we need to add a prefix to each sentence. Here, the prefix can be any text added as a column with same value for each row.

# Renaming the columns
dataset_df.columns = ["input_text","target_text"]
# Adding a prefix. Here we shall keep "paraphrase" as a prefix.
dataset_df["prefix"] = "paraphrase"

Splitting the dataset

We shall split the dataset in a ratio of 90%-10%

from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(dataset_df,test_size=0.1)

Training the model

The model needs certain parameters to be tweaked, which can be found as:

Initializing the T5Model class object from simpletransformers:

from simpletransformers.t5 import T5Model
import sklearn
model = T5Model("t5","t5-small", args=args)

We shall go with the t5-small model for now. Let’s proceed with the training:

model.train_model(train_data, eval_data=test_data, use_cuda=True,acc=sklearn.metrics.accuracy_score)

Loading and predicting using the trained model

It might take a few hours for the model to train. Once the training is complete, you may find the final model in the outputs directory. Which can be loaded as:

Loading the trained model

from simpletransformers.t5 import T5Model
import os
root_dir = os.getcwd()
trained_model_path = os.path.join(root_dir,"outputs")
args = {
"overwrite_output_dir": True,
"max_seq_length": 256,
"max_length": 50,
"top_k": 50,
"top_p": 0.95,
"num_return_sequences": 5
}
trained_model = T5Model("t5",trained_model_path,args=args)

Generating paraphrases using the trained model

Let’s see how the model performs with our custom input:

prefix = "paraphrase"
pred = trained_model.predict([f"{prefix}: The house will be cleaned by me every Saturday."])
print(pred)
#Output:
[['My home will be cleaned on Saturdays.', 
'I will clean the house every Saturday.', 
'The house is going to be clean every Saturday.', 
"I'll clean the house every Saturday.", 
'I will clean the house every Saturday.']]

And it works!! Yay!

That’s all with the T5 model training. I’ve open sourced pretrained models and preprocessed datasets for paraphrasing on my Github repo if you wish to explore them.

Thank you for reading 😄


Written By

Het Pandya

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles