Deep Haiku: Teaching GPT-J to Compose with Syllable Patterns

How to generate rhythmic prose after fine-tuning a large transformer with phonemes

Mar 8, 2022

19 min read

👁 Photo Illustration by Author, Source Image by Diana Polekhina on Unsplash

Photo Illustration by Author, Source Image by Diana Polekhina on Unsplash

In this article, I’ll show you how I fine-tuned an AI system called GPT-J to create new Haikus, short poems following a form that originated in Japan. The key was to get my model, Deep Haiku, to see and understand the number of syllables in lines of poetry.

I usually don’t show a table of contents in my articles, but I think it’s warranted for this one because of the variety of techniques I used for the project. Feel free to skip to any of the sections below if you are interested in these topics:

Adding punctuation and capitalization to text with FastPunct;
Using GRUEN to assess the quality of generated text;
Extracting topics from text with KeyBERT;
Splitting text into syllables and phonemes with Phonemizer;
Performing multitask training on Transformers;
Fine-tuning GPT-J 8-bit on Google Colab (for free!);
Flagging obscene and threatening text with Detoxify;

If you would like to create a Haiku for a particular topic, you can use the Colab here. And be sure to check out the generated Haikus in the appendix.

Background

As we say in Boston, the GPT-3 language generation model from OpenAI is wicked smaht. You can ask anything, and it will come up with a reasonable and often insightful answer. And if you ask it to generate creative prose, like poetry, it does a surprisingly good job.

But GPT-3 and most other language models can’t seem to write prose using meter, the rhythmic structure often used in poetry. This is because of how words are represented in the models. Most language models use word parts, not letters or syllables as their data type. The language generation systems simply don’t know how many syllables there are in words, so they can’t pick up and replicate any rhythmic patterns.

For example, here is an interaction with OpenAI’s GPT-3, where I ask about syllables and prompt it to create a Haiku. Note that my prompts are bold, and the response text is from the GPT-3 davinci model using the default parameters.

👁 GPT-3 Playground, Image by Author

GPT-3 Playground, Image by Author

OK, it clearly knows what syllables are and the syllable count commonly used for Haikus [5, 7, 5]. But when I asked it to write a Haiku about autumn, it came up with a lovely poem with a syllable count of [5, 6, 7].

Here are some generated Haikus for the four seasons written by GPT-3 and my new model, Deep Haiku. The meter is indicated in gray.

👁 Haikus for the Seasons by GPT-3 and Deep Haiku, Table by Author

Haikus for the Seasons by GPT-3 and Deep Haiku, Table by Author

As you can see, both systems created a nice set of poems. I will leave it to you to judge the quality of the prose, but it’s clear that GPT-3 doesn’t know how to follow the standard meter. In contrast, Deep Haiku used a [5, 7, 5] meter in all four Haikus.

Prior Work

Apparently, I am not the only one looking to get a transformer to generate text with metered prose. For example, in his paper "Haiku Generation, A Transformer Based Approach, With Lots Of Control," Giacomo Miceli notes that the typical Haiku meter pattern is not strictly followed [1].

Modern and especially English language haikus do not follow the 5–7–5 pattern very strictly, but usually adhere to a short-long-short form of around 10/12 words in total. – Giacomo Miceli

Miceli’s Haikoo system creates some excellent prose, too, but only 1 of the 12 examples in the paper follows the [5, 7, 5] pattern. Note that I borrowed the first line to see where Deep Haiku would take it. Here they are.

👁 Sample Haikus, by Haikoo and Deep Haiku, Table by Author

Sample Haikus, by Haikoo and Deep Haiku, Table by Author

Miceli references another paper that directly addresses the meter of generated prose. It’s not for writing Haikus; it’s for writing limericks. In their paper, "There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It," Jianyou Wang et al. discuss their early experiments using OpenAIs GPT-2 model.

A naïve implementation of GPT-2 simply cannot produce original and valid limericks. GPT-2 tends to generate long sentences that exceed the syllable limit for limericks. To meet a syllable constraint, we would need to truncate the generated sentences, which creates lines that do not end correctly. – Jianyou Wang, et al.

They go on to describe a custom transformer for generating limericks, LimGen, that chooses words with constraints on part-of-speech, syllable counts, and placement of stressed syllables.

For Deep Haiku, I built a system that generates Haikus from a user-specified prompt with adherence to a [5, 7, 5] meter by fine-tuning a general-purpose transformer.

Overview

Below is a diagram of the components and processes I used to train and run Deep Haiku. After a quick discussion of what each part does, I’ll get into things in more detail.

👁 Deep Haiku Components, Diagram by Author

Deep Haiku Components, Diagram by Author

I started by downloading two datasets of Haikus from Kaggle.com from users hjhalani30 and bfbarry. The datasets were released under the CC0 and CC-BY licenses, respectively. The number of Haikus in the combined datasets is over 140K. I used FastPunct to add punctuation and casing to the Haikus and ran the KeyBERT model [3] to extract phrases used as prompts.

I then filtered the data by using the GRUEN metric to gauge the quality of the text [4] and the phoenemizer library [5] to both count the syllables and convert the prompts and Haikus into phonemes. The filtering yielded over 26K relatively high-quality Haikus that all have a [5, 7, 5] meter.

I used the GPT-J 6B model from Eluther [6] as the basis for Deep Haiku. After quantizing the model down to 8-bits to run on a Google Colab, I fine-tuned it using the filtered Haikus as training data for ten epochs, which took 11 hours on Google Colab with a Tesla V100 GPU.

Generating new Haikus starts with selecting a word or phrase as a prompt, like "autumn." I use the fine-tuned model to create 20 candidate Haikus. The results are filtered for adherence to the meter and optionally filtered to remove candidates that contain explicit language using the Detoxify library. Yes, Deep Haiku knows how to swear. The remaining candidates are then displayed along with the scores.

For example, I generated 20 Haikus generated with the prompt "autumn," and 11 used the [5, 7, 5] meter. Here are the filtered results with the scores.

👁 Sample Output from Deep Haiku, Table by Author

Sample Output from Deep Haiku, Table by Author

OK, the top 3 look pretty good, albeit a tad corny. But none of them used swear words, so the toxicity was near zero for all. The only slight blip was the sixth one that mentioned the "end of days" and "the cusp of death." A little dark for a Haiku about autumn, but not exactly offensive.

👁 Photo by Chris Lawton on Unsplash

Photo by Chris Lawton on Unsplash

System Details

Training Data

As with many of my projects, this one started with getting the training data. If you don’t have good training data, it is hard to get a good working AI model.

Fortunately, there are at least two suitable Haiku datasets available on Kaggle. The first one by bfbarry contains over 11K Haikus collected and cleaned. He released the dataset under the Creative Commons CC0 license. The second Haiku dataset is from hjhalani30 on Kaggle. This is a much bigger dataset aggregated from various places. There are over 140K Haikus in the collection. The vast majority, over 110K, comes from Twitter associated with the #twaiku hashtag. This dataset was released under the Creative Commons CC BY 4.0 license.

Adding Punctuation and Capitalization to Text with FastPunct

I noticed that the Haikus in the first dataset were all lowercase and without punctuation. I know that writing poems this way is a style choice, but I decided to add uppercase letters and punctuation because it helps with the text quality analysis in the next step. I used the FastPunct module for this.

Here’s some sample code in Python that adds punctuation and casing to one of the Haikus in the dataset.

from fastpunct import FastPunct

fastpunct = FastPunct()
print(fastpunct.punct(["""was it all a dream
 i most certainly hope not
 that was happiness"""]))
# Output:
# Was it all a dream?
# I most certainly hope not.
# That was happiness.

Notice how capitalizing the first word of each sentence also recognizes the first line as a question and punctuates it accordingly. Here is a couple of Haikus from the dataset to illustrate what FastPunct does.

👁 Haikus from the Bfbarry dataset Before and After Using FastPunct, Table by Author

Haikus from the Bfbarry dataset Before and After Using FastPunct, Table by Author

Notice how FastPunct did a little more work with these samples. The last example added a comma after "hide" and put an apostrophe before the contracted "cause."

Some of the Haikus in the second dataset have casing and punctuation, but many don’t. So I stripped the punctuation and case and put it back in with FastPunct for consistency. Here’s what some examples look like from this dataset.

👁 Haikus from the Hjhalani30 Dataset Before and After Using FastPunct, Table by Author

Haikus from the Hjhalani30 Dataset Before and After Using FastPunct, Table by Author

Combining the datasets gave me over 150K Haikus. I then filtered them down using the methods described in the following steps.

Using GRUEN to assess the Quality of Generated Text

If you have been reading about Natural Language Processing (NLP) recently, you have probably heard of BLEU’s metric and its variants. BLEU stands for Bi-Lingual Evaluation Understudy and is an algorithm for evaluating the quality of text translated from one natural language to another using a computer process. You have before and after text with translation tasks, and BLEU lets you know if the generated text matches the expected translation written by a human. You can read how the BLEU algorithm works in Renu Khandelwal’s write-up on TDS.

Although BLEU works well for assessing text quality based on the expected result, it doesn’t help determine the quality of creative writing. So I used an automated system called GRUEN to assess the quality of the prose used in the Haiku dataset.

The GRUEN system by Wanzheng Zhu and Suma Bha [4] tries to automatically asses text using three of the qualities spelled out by Hoa Trang Dang at the 2006 Document Understanding Conference (DUC) [9].

Q1: Grammaticality – The text should have no capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read.

Q2: Non-redundancy – There should be no unnecessary repetition.

Q3: Focus – **** The text should have a focus; sentences should only contain information that is related to the whole of the text.

Here’s some sample code that shows how to use GRUEN.

import GRUEN.Main as gruen
doc =["Dendelion blooms. In dapples of sunshine. The first brushstrokes of spring."]
print(gruen.get_gruen(doc)[0])

# Output 0.72511

The result from the GRUEN metric is a single number from 0.0 to 1.0, which is an aggregate of the three text qualities, where bigger is better. And here are the GRUEN scores for the eight Haikus above.

👁 Haikus from the Datasets with GRUEN Quality Scores, Table by Author

Haikus from the Datasets with GRUEN Quality Scores, Table by Author

You can see how the system is rating Haikus with three standalone sentences or clauses higher than ones that consist of a single sentence split arbitrarily into three parts. I filtered the dataset only to include Haikus with a score of 0.5 or greater, yielding 45K samples.

Extracting Topics from Text with KeyBERT

Because I want to fine-tune the model to create a new Haiku given a subject, I needed to extract a keyword or phrase from the Haikus in the training set to condition the system. For that, I turned to the KeyBERT system, which is already trained to extract keywords from text. Here is the source code I used.

from keybert import KeyBERT
kw_model = KeyBERT()
doc = """An old silent pond.
 A frog jumps into the pond.
 Splash! Silence again."""

keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
print(keywords[0][0])

# Output: silent pond

The KeyBERT system is "extractive," meaning it will always choose a word or phrase that is contained in the source text. Here are the topics that the system extracted from a selection of sample Haikus.

👁 Haikus from the Datasets with topics from KeyBERT, Table by Author

Haikus from the Datasets with topics from KeyBERT, Table by Author

As you can see, it found an important word or phrase from each Haiku. I used this extracted text as topics to condition GPT-J to write Haikus based on prompts during training.

Splitting Text into Syllables and Phonemes with Phonemizer

As I mentioned above, Transformer models use word parts for text encoding and therefore cannot "see" phonemes. This makes it nearly impossible for the model to find and replicate syllable patterns in text. To solve this problem, I translated the sample Haikus into phonemes with syllable breaks to teach the model how to see the meter in text.

After experimenting with several techniques for converting text from English to phonemes, I found that the Phonemizer project works best.

Phonemizer can use the following backends: ESpeak, ESpeak-Mbrola, Festival, and Segments [5]. Of these, I found that Festival works the best for my task at hand. Festival is a Text-to-Speech engine that only supports American English. I didn’t use it to create sound files from text, as it was designed to do. I did, however, use it to convert speech to phonemes with tokenization at the syllable level. Here is some sample code that shows how to use the package.

from phonemizer import phonemize
from phonemizer.separator import Separator

doc = """Awaken before dawn.
 I hear the city rising.
 The new day begins."""

phn = phonemize(doc, language='en-us', backend='festival',
 with_stress=False, separator=Separator(phone=None,
 word=' ', syllable="|"), strip=True)
print(phn)

# Output:
# ax|wey|kaxn biy|faor daon
# ay hhihr dhax sih|tiy ray|zaxng
# dhax nuw dey bax|gihnz

It’s a little hard to read the phoneme output from Festival, but you can eventually get the hang of it. Note that I separate syllables using a | character, as that will help GPT-J figure out the syllable counts.

For example, the word awaken is written as ax|wey|kaxn in the Festival phonetic notation with syllable markers, as compared to əˈwākən in the standard International Phonetic Alphabet.

Here are some sample Haikus in clear text and Festival phonetic notation.

👁 Sample Haikus from the Datasets with Phonemes by Phonemizer, Table by Author

Sample Haikus from the Datasets with Phonemes by Phonemizer, Table by Author

After getting the dataset in text and phoneme form, I used the syllable counts to only use Haikus with [5, 7, 5] meter for training. I then used the text and phoneme versions of the Haikus to perform multitask learning to fine-tune GPT-J. That winnowed the training dataset down to 26K samples.

Performing Multitask Training of Transformers

Multitask Learning is an approach to training machine models developed by Rich Caruna at Carnegie Mellon. In his paper "Multitask Learning," he states that the technique improves generalization by using the domain information contained in the training data of related tasks. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better [7].

For Deep Haiku, I fine-tuned GPT-J by teaching it to perform the following four tasks:

Generate a Haiku for a given topic using text;
Generate a Haiku for a given topic using phonemes;
Translate a Haiku from text to phonemes;
Translate a Haiku from phonemes to text;

Note I used parenthesis to encapsulate the text for the first task, angle brackets for the second task, square brackets for the third task, and curly braces for the fourth task. I did this as a hint for the text generator to know which task is which. And I used the equals sign to separate the inputs and outputs for all tasks. For example, here are four lines of training data for one of the sample Haikus.

(encouragement = Need encouragement. / Making myself positive. / I want happiness.)

<axn|ker|axjh|maxnt = niyd axn|ker|axjh|maxnt / mey|kaxng may|sehlf paa|zax|tihv / ay waant hhae|piy|naxsy>

[need encouragement / making myself positive / i want happiness = niyd axn|ker|axjh|maxnt / mey|kaxng may|sehlf paa|zax|tihv / ay waant hhae|piy|naxs]

{niyd axn|ker|axjh|maxnt / mey|kaxng may|sehlf paa|zax|tihv / ay waant hhae|piy|naxs = need encouragement / making myself positive / i want happiness}

Note that the topic for generating Haikus in clear text is specified as text, and the topic for generating Haikus in phonemes is specified as phonemes.

I hoped that training the system to learn the four tasks concurrently would help it write interesting and coherent Haikus in a [5, 7, 5] meter.

Finetuning GPT-J 8-bit on Google Colab (For Free!)

To have GPT-J understand, learn, and perform all four tasks, I used a Transformer model with 6 billion parameters. According to eluther.ai, the GPT-J 6B model is the size of OpenAI’s curie model, their second-largest GPT-3 model. Their biggest model, davinci, has a whopping 175B parameters.

Currently, Google Colab only uses GPUs with 16 Gigs of memory, and the 32-bit version of GPT-J 6-B will run out of memory. To fine-tune it using Google Colab, I used their 8-bit version of the model. A detailed explanation of how it works can be found in this model card.

My Colab for training the system is based on the work here. It uses a Low Rank Adaptation technique by Edward Hu et al. at Microsoft [8].

As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. – Edward Hu, et al.

I trained the system for 11 hours on Google Colab. Here are the parameters I used for training, based on Nikita Schneider’s article.

num_train_epochs = 10
batch_size = 2
warmup_steps = 100
weight_decay = 0.01

And here’s the output of the trained model for the topic, rain:

Walking in the rain The pavement takes on a shine Of old memories

Deep Haiku

So it seemed to have worked! The fine-tuned model can generate Haikus that will follow the [5, 7, 5] pattern more often than not, which is much better than models without multitask learning with phonemes. And it seems to have the ability to tap into the contexts from the initial training. For example, here are three generated Haikus on the subject of AI and ML:

AI and ML / Will be able to predict / What you are thinking AI and ML / Cannot develop a soul / They are just numbers AI and ML / Are just as racist as us / We are the problem

Note that there were not any Haikus on the subject of AI and ML in the training dataset. And for style purposes, I removed any trailing periods.

Flagging Obscene and Threatening Text with Detoxify

As I mentioned earlier, Deep Haiku knows how to use obscenities, as I didn’t filter the training data to remove explicit content. And even if I did, it probably would still occasionally use profanities because it was initially trained on a large, unfiltered corpus of text.

To flag or filter explicit text, I used the Detoxify module to check the output of Deep Haiku. Detoxify looks for the following types of speech: Toxic, Severe Toxicity, Obscene, Threat, Insult, Identity Attack [9].

Here is how the system rates toxic comments found on the Talk pages on Wikipedia.

👁 Toxic Comments on Wikipedia Talk Pages, Source: Kaggle, Table by Author

Toxic Comments on Wikipedia Talk Pages, Source: Kaggle, Table by Author

All of these comments scored high on toxicity for various reasons. For my purposes, I filter out any toxic Haikus, which keeps them safe for family viewing.

Here are some more Haikus on the topic of rain, with the quality and toxicity scores.

👁 Sample Output from Deep Haiku, Table by Author

Sample Output from Deep Haiku, Table by Author

Be sure to check out more generated Haikus in the appendix or create your own here.

Discussion

As shown above, you can teach large Transformers about counting syllables using multitask learning. Note that it is not 100% perfect; many generated Haikus do not follow the [5, 7, 5] pattern. But it did dramatically increase the probability of following it. And it seemed to have retained its original language training on many different subjects.

For future work, there may be a way to use a custom tokenizer for Transformer models that is based on phonemes and syllables, not word parts. The system would inherently be able to see and replicate the meter of prose. Doing so would eliminate the need for multitask learning. The tricky part will be retaining the knowledge from the initial training that was performed on word parts.

Source Code and Colabs

All source code for this project is available on GitHub. I released the source code under the CC BY-SA license.

👁 Creative Commons Attribution Sharealike

Creative Commons Attribution Sharealike

Acknowledgments

I want to thank Jennifer Lim and Oliver Strimpel for their help with this project.

References

[1] Haikoo, G. Miceli, Haiku Generation, A Transformer Based Approach, With Lots Of Control (2021)

[2] LimGen, J. Wang, et al., There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It (2021)

[3] M. Grootendorst, KeyBERT: Minimal keyword extraction with BERT (2020)

[4] W. Zhu and S. Bhat, GRUEN for Evaluating Linguistic Quality of Generated Text (2020)

[5] M. Bernard, Phonemizer: Text to Phones Transcription for Multiple Languages in Python (2016)

[6] GPT-J, Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX (2021)

[7] R. Caruana, Multitask learning (1997)

[8] E. Hu, et al., LoRA: Low-rank Adaptation of Large Language Models (2021)

[9] L. Hanu and the Unitary Team, Detoxify (2020)

Appendix

Here are examples of the output of Deep Haiku for the following topics. These are the ones I deemed the best from batches of 20.

COVID

Coronavirus Lockdown with face coverings Dying to see life

Haircut

Just had a haircut My hair is no longer thick It’s now more refined

Laughter

I’m still trying to Find the right combination Of laughter and tears

Morning

The morning brings us A new day to start over Let’s see how I feel

Music

There’s a reason why The ghosts never make it to The music section

Python

A good knowledge of The Python language can be A big advantage

Written By

Robert A. Gonsalves

See all from Robert A. Gonsalves

AI, Editor’s Picks, Gpt J, Haiku, NLP

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/deep-haiku-teaching-gpt-j-to-compose-with-syllable-patterns-5234bca9701/