GreenLIT: Using GPT-J with Multi-Task Learning to Create New Screenplays
How to fine-tune an ML model to create TV shows and movies with new titles, plot summaries, and scripts
I showed in a previous post how I fine-tuned GPT-J to generate Haikus, and the results were pretty good. For my latest experiment, GreenLIT, I wanted to push the limits of using GPT-J for creative writing to see if it could produce scripts for brand new TV shows and movies.
Here is a block diagram of the components and processes for GreenLIT. I’ll discuss these in detail in the sections below.
To start, I searched around the web and found five open-source datasets with text that I used as the training data:
- Movie Dialogs from the ConvoKit project from Cornell (MIT license) [1]
- Movie Plot Summaries from Rounak Banik on Kaggle (CC0 license) [2]
- Friends Dialogs from the ConvoKit project (MIT license) [1]
- Friends Plot Summaries from Wikipedia (CC-BY-SA license) [3]
- Streaming Show Analysis from Shivam Bansal on Kaggle (CC0 license) [4]
The first and third datasets contain dialog from scripts from movies and the popular sitcom "Friends." The other datasets include titles, genres, and plot summaries of TV shows and movies. I wrote some Python code to preprocess and filter the text to create a set of entries for the four tasks for GPT-J to learn:
- Genre and Theme → Title and Plot
- Title and Plot → Dialog
- Title and Plot → Script
- Dialog → Script
Note that dialog means just the character names and lines, whereas the script is the dialog plus stage directions. I’ll explain all this in detail further below.
I started with the GPT-J [5] model, an open-source variant of GPT-3 [6]. GPT-J was previously trained on an extensive collection of text data from the web called The Pile [7].
I fine-tuned GPT-J with over 38,000 training entries for 20 hours on Google Colab to create my GreenLIT model. After I fine-tuned it, I used the following text prompt as input.
GENRE: crime drama
THEME: cryptocurrency
From this prompt, the GreenLIT model generated the following titles and plot summaries:
#1. TITLE: Bitcoin Heist
SUMMARY: Two young men embark on a heist that could be the start of a new revolution.
#2. TITLE: Satoshi's Daughter
SUMMARY: A crime drama about a woman who inherits a cryptocurrency fortune.
#3. TITLE: The Dark Web
SUMMARY: The story of two brothers who join the underground Bitcoin economy.
#4. TITLE: Cryptostorm
SUMMARY: Cryptostorm revolves around the story of a young man who has been trying to keep his life on track after being sent to
prison for a crime he didn't commit.
#5. TITLE: The Cypher
SUMMARY: A young hacker joins a group of underground hackers to create a secure platform.
I chose #3, The Dark Web, from this list and sent the title and summary back into GreenLIT to create the script’s first page. Here it is.
TITLE: The Dark Web
SUMMARY: The story of two brothers who join the underground Bitcoin economy.
SCRIPT:
[Scene: The living room of the house, the door opens and the brothers come in. They are both dressed in black, have their hoods up.]
LEO - Okay, we did it! We hit the mother lode!
DREW - We hit the mother lode?
LEO - Yeah!
DREW - How much is in here?
LEO - Six hundred and forty dollars worth.
DREW - (To himself) Well, that's better thaan nothing.
LEO - The bitcoins are untraceable.
DREW - The bitcoins?
LEO - Yeah, the Bitcoin is basically just a name for a cryptographic currency.
DREW - A cryptographic currency?
LEO - Yeah, like, a secure, internet way of transferring money from one person to another.
DREW - Sounds like they invented the internet.
LEO - Well, the internet is just a network of computers connected to other computers by phone lines, but the internet is also the thing that makes it possible to transfer money from one computer to another without a bank.
DREW - Oh, yeah?
LEO - Yeah.
DREW - How do you do that?
LEO - Well, it's actually really complicated, and I don't understand it myself, but the short version is that the Bitcoin is run by a bunch of computers all connected to each other. It's kind of like a decentralized Ponzi scheme.
DREW - A decentralized Ponzi scheme?
LEO - Yeah, you know, like the government runs a Ponzi scheme.
OK, interesting. The script seems to explain the basics of cryptocurrency humorously and develop the characters a bit. And with a pinch of social commentary. Hold on a sec while I call my agent. 😉
You can check out GreenLIT here for free on Google Colab. And be sure to check out more sample output in the appendix below.
System Details
In the following sections, I will dive into the details of the components and processes used in GreenLIT. I’ll start with a discussion on a technique for training neural networks known as Multitask Learning.
Multitask Learning
I had two main goals for the GreenLIT project, (A) create titles and plot summaries of new shows given a genre and theme, and (B) create the first page of a script given a title and plot summary. Although fine-tuning two specialized AI models would work, I wanted to see if one fine-tuned model could accomplish both tasks. Doing so would have a couple of advantages. As I discovered in my Deep Haiku project, fine-tuning one model for multiple but similar tasks, known as Multitask Learning, can improve the results of both tasks. Rich Caruna studied this technique [10] at Carnegie Mellon.
Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better. – Rich Caruna
To explain how Multitask Learning works, Alexandr Honchar describes a concept called "feature selection double-check" in his write-up here. He says, "If one feature is important for more than just one task, then most probably this feature is indeed very important and representative for your data" and will be reinforced by the system during Multitask Learning.
The other advantage is practical efficiency – only one AI model needs to be loaded to perform both tasks. Using one model reduces disk storage, load times, and GPU memory.
Next, I’ll discuss how I gathered the training data for the project.
Gathering Training Data
To fine-tune the system for the first task, generating titles and plot summaries of new shows, I looked for open-source datasets with metadata for movies and TV shows.
Gathering Movie Plots
On Kaggle, which hosts numerous datasets, I found a large list of movie plot summaries by Rounak Banik called The Movie Dataset. It contains titles, release years, genres, summaries, etc., for over 40K movies. He released the dataset under the CC0 (Public Domain) license. Here is a sample of 5 entries.
I used a module called KeyBERT [9] to extract the themes from the summaries. You can see my Python code here.
I found another collection of datasets on Kaggle from Shivam Bansal. He collected summaries of about 20K streaming shows on Netflix, Amazon, Hulu, and Disney+. Here is a sample of the data.
I used KeyBERT again to capture themes from the summaries of the streaming shows.
To teach GPT-J how to create titles and summaries from genres and themes, I assemble an entry like this for each movie and TV show.
GENRE: action science fiction
THEME: saving the world
TITLE: The Matrix
SUMMARY: Set in the 22nd century, The Matrix tells the story of a computer hacker who joins a group of underground insurgents fighting the vast and powerful computers who now rule the earth.
GENRE: comedy sitcom
THEME: workplace comedy
TITLE: 30 Rock
SUMMARY: The life of the head writer at a late-night television variety show. From the creator and stars of SNL comes this workplace comedy. A brash network executive bullies head writer Liz Lemon into hiring an unstable movie star.
Gathering Movie and TV Scripts
Next up, I searched for datasets of scripts. To quote Leo from The Dark Web, "I hit the mother lode" when I found the ConvoKit from Cornell. The formal name of the collection datasets is the Cornell Conversational Analysis Toolkit [1], released under the MIT open-source license.
[ConvoKit] contains tools to extract conversational features and analyze social phenomena in conversations, using a single unified interface inspired by (and compatible with) scikit-learn. Several large conversational datasets are included together with scripts exemplifying the use of the toolkit on these datasets. – Jonathan P. Chang et al.
I used dialog from two of the datasets in ConvoKit to fine-tune GreenLIT. Here are the descriptions of the datasets from their site.
- Cornell Movie-Dialogs Corpus – A large metadata-rich collection of fictional conversations extracted from raw movie scripts. (220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies).
- Friends Corpus – A collection of all the conversations that occurred over ten seasons of Friends, a popular American TV sitcom that ran in the 1990s.
Here’s a snippet of dialog from the Hotel Rwanda from the Cornell Movie-Dialogs Corpus.
PAUL - What's wrong?
ZOZO - Beg your pardon sir, you are Hutu. You are safe there.
PAUL - You are with me, Zozo, don't worry.
ZOZO - What is it like to fly on a plane, sir?
PAUL - It depends where you sit Zozo. In coach it is like the bus to Giterama.
ZOZO - That is why they call it coach?
PAUL - Maybe. But in business class there are fine wines, linens, Belgian chocolates.
ZOZO - You have taken business class?
PAUL - Many times.
PAUL - I will try my best George but these days I have no time for rallies or politics.
GEORGE - Politics is power, Paul. And money.Gathering TV Scriptss
And here is a snippet of a script from the Friends Corpus, set in the gang’s favorite coffee shop, Central Perk.
SCRIPT:
[Scene, Central Perk]
MONICA - There's nothing to tell! He's just some guy I work with!
JOEY - C'mon, you're going out with the guy! There's gotta be something wrong with him!
CHANDLER - All right Joey, be nice. So does he have a hump? A hump and a hairpiece?
PHOEBE - Wait, does he eat chalk?
(They all stare, bemused.)
PHOEBE - Just, 'cause, I don't want her to go through what I went through with Carl- oh!
MONICA - Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.
CHANDLER - Sounds like a date to me.
Adding Stage Directions
Note that, unlike the script from Friends, the script for Hotel Rwanda does not have any stage directions. It just has the dialog.
To teach the GreenLIT model how to add stage directions, I created a set of scripts from Friends with just the dialog, shown below, followed by the script. These training entries are composed as: "DIALOG:" + lines + "SCRIPT:" + lines with stage directions.
DIALOG:
MONICA - There's nothing to tell! He's just some guy I work with!
JOEY - C'mon, you're going out with the guy! There's gotta be something wrong with him!
CHANDLER - All right Joey, be nice. So does he have a hump? A hump and a hairpiece?
PHOEBE - Wait, does he eat chalk?
PHOEBE - Just, 'cause, I don't want her to go through what I went through with Carl- oh!
MONICA - Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.
CHANDLER - Sounds like a date to me.
After the fine-tuning, if I end the prompt with "… DIALOG:", it will create just dialog. But if I end the prompt with "… SCRIPT:", it will know to generate dialog with stage directions. This is Multitask Learning in action!
Next, I will discuss how I solved a problem with repeating character names in the generated scripts.
Diversifying Character Names
After some initial experiments, I noticed that the inclusion of the Friends scripts in the training dataset caused the model to use the names of the six central characters often. For example, the system would create period pieces set in the 18th century with characters named Joey, Phoebe, and Chandler.
To diversify the character names, I swapped out the character names for each of the 236 Friends episodes. I used a list of first names collected by the US Social Security Office.
For example, the script shown above used these character names for the training data:
Ross → Lucas Chandler → Antonio Joey → Eddie Rachel → Charlotte Phoebe → Stella Monica → Luciana
I also changed any references to "Central Perk" to "Coffee Shop" to help remove the "Friends-ness" of the scripts. Here’s the modified script:
SCRIPT:
[Scene, Coffee Shop]
LUCIANA - There's nothing to tell! He's just some guy I work with!
EDDIE - C'mon, you're going out with the guy! There's gotta be something wrong with him!
ANTONIO - All right Eddie, be nice. So does he have a hump? A hump and a hairpiece?
STELLA - Wait, does he eat chalk?
(They all stare, bemused.)
STELLA - Just, 'cause, I don't want her to go through what I went through with Carl- oh!
LUCIANA - Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.
ANTONIO - Sounds like a date to me.
It’s interesting to see how simply changing the character names makes it seem like a different show.
Gathering Plot Summaries for Friends
Since the ConvoKit dataset does not contain any plot summaries, I scraped the summaries for all of the Friends episodes from Wikipedia.
The source code for getting the summaries is here. I used KeyBERT once again to get keywords for the episodes’ themes.
Here is a summary of the data I collected for training GPT-J.
Once I had all the training in place, I started fine-tuning GPT-J to create new shows and scripts.
Fine-Tuning GPT-J
Similar to what I did in my Deep Haiku project, I fine-tuned GPT-J to learn and run all four tasks needed for GreenLIT:
- Genre and Theme → Title and Plot
- Title and Plot → Dialog
- Title and Plot → Script
- Dialog → Script
The GPT-J model from Eleuther AI [5] is the size of OpenAI’s curie model, the second-largest GPT-3 model [6]. Eleuther AI trained the model on The Pile, a vast (800GB) dataset of diverse text used for language modeling [7].
Google Colabs only have GPUs with 16 Gigs of memory, and GPT-J will run out of memory if used as-is. To run it on a Colab, I converted GPT-J 6-B with 32-bit parameters to 8-bits using the Low-Rank Adaptation technique by Edward Hu et al. at Microsoft [8]. You can read the details on Eleuther AI’s model card here.
Running the System
After fine-tuning the model, I used it to create new titles and summaries along with scripts for the shows.
One of the first things I noticed, however, is that the system will often generate titles for shows that have been made before. This is not too surprising because I fine-tuned it with data from existing shows. To compensate for this, I created a little Python function to check if any newly created titles had been used before, using Wikipedia as a reference.
For example, I ran the system with GENRE: crime comedy and THEME: buddy bank heist. Four of seven generated titles are from existing shows. I marked these "duplicates" with a string of asterisks in the results below.
01 The Bank Heist Gang - A group of friends decides to do a bank heist before things go haywire.
02 *** The Bank Job - A group of ex-cons and petty criminals rob a bank in a small town. While carrying out the heist, they become involved in romantic entanglements, and the heist itself turns into a big, silly mess.
03 *** Buddy Buddy - The film opens with a young man who, while on a business trip, is robbed by his own bodyguard who is also the son of the man who owns the bank.
04 The Bumstead Affair - When a bank robbery goes awry, a group of young men decide to get rich quick by robbing the other robbers.
05 *** Get Shorty - Get Shorty is a crime comedy centered around three friends who get involved in a heist.
06 The Great American Caper - A man and his best friend have a scheme to rob a bank.
07 *** The Hustler - A Las Vegas bank robber attempts to double-cross the mob by holding up their own bank... and the movie changes into a slapstick comedy.
I chose The Bumstead Affair from the original titles to see a generated script. Here it is.
[Scene: All of the guys are in the basement where Trench is at the computer. The guys are discussing the upcoming robbery attempt.]
D'JACK - We're gonna hit the bank on Sunday. The place is closed on Sunday. The cops will never know.
TRENCH - Yeah, but...
D'JACK - But what?
TRENCH - What about the cameras?
D'JACK - The cameras are gonna be out.
(The guys are looking at the computer monitor)
D'JACK - And we'll make sure the alarm's gonna be out, too.
TRENCH - Yeah, but then...
D'JACK - Then what?
TRENCH - Then we'll be in the bank and we're gonna get rich?
D'JACK - Right, that's the goal.
OK, that’s fairly straightforward. But I do like the character names Trench and D’Jack. And I like the inclusion of stage directions, too.
Discussion of Results
I played around with GreenLIT for a week or so, and it seems to do a pretty good job of coming up with new ideas for shows, even though it often reuses titles. Note that plugging in a novel genre and theme reduces the replication.
However, the dialog in the generated scripts seems to be a little bland to my ears. The good news is that the conversations seem to flow naturally, but the content of the prose is often quite rudimentary. This may be because all of the new scripts are, by design, for Scene 1, Page 1 of a show. And they hop right into introductory exposition.
Check out the appendix for more example scripts.
Source Code and Colabs
All source code for this project is available on GitHub. I released the source code under the CC BY-SA license.
Acknowledgments
I want to thank Jennifer Lim and Oliver Strimpel for their help with this project.
References
[1] Jonathan P. Chang, Caleb Chiam, Liye Fu, Andrew Wang, Justine Zhang, Cristian Danescu-Niculescu-Mizil. 2020. "ConvoKit: A Toolkit for the Analysis of Conversations" Proceedings of SIGDIAL.
[2] R. Banik, The Movies Dataset on Kaggle (2018)
[3] Wikipedia, Friends Plot Summaries
[4] S. Bansal, Streaming Movies and TV Shows on Kaggle (2022)
[5] GPT-J, Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX (2021)
[6] GPT-3, T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A.Neelakantan, et al. Language Models Are Few-Shot Learners (2020)
[7] L.Gao, et al., The Pile: An 800GB Dataset of Diverse Text for Language Modeling (2020)
[8] R. Caruana, Multitask Learning (1997)
[9] M. Grootendorst, KeyBERT: Minimal keyword extraction with BERT (2020)
[10] E. Hu, et al., LoRA: Low-rank Adaptation of Large Language Models (2021)
Appendix
The following screenplays for new shows were created using GreenLIT. I supplied the genre and theme, and the system generated 5 to 10 titles and summaries. I then chose my favorite, and the system generated the scripts.
Robopocalypse
GENRE: science fiction drama
THEME: robot uprising
TITLE: Robopocalypse
SUMMARY: Robopocalypse is a science fiction drama about a coming robot uprising, and the people left behind to survive it.
SCRIPT:
[Scene: A lab. Dr. Nathan Frost is looking at a robot on a cart. A technician is walking around the room, adjusting wires on the robot.]
FROST - What the hell is he doing?
TECHNICIAN - We're trying to figure that out. He's been giving us static for the last three hours.
FROST - Static?
TECHNICIAN - Yeah, like he's been transmitting.
FROST - What does he do?
TECHNICIAN - I've been running diagnostics on him, but there's nothing wrong.
FROST - Does he have a malfunction?
TECHNICIAN - We've tried every possible combination of switches, buttons, and dials, and nothing seems to work.
FROST - He's got to be controlled by a central computer. Who is he?
TECHNICIAN - He doesn't have a tag, like the other robots. He's supposed to be a model.
FROST - Where'd he come from?
TECHNICIAN - He looks like the others, except for the fact that he's a few hundred years ahead of the development schedule.
FROST - Well, he's ours now, so I guess we'll have to wait till he figures out what he's supposed to do.
The Night Stands Still
GENRE: thriller drama
THEME: cold war spy
TITLE: The Night Stands Still
SUMMARY: When a Russian defector arrives at the home of an American diplomat, he makes a startling confession, and the diplomat and the diplomat's wife are forced to hunt for a killer, and a traitor.
[Scene: The American embassy in Moscow. The diplomat, an older man, is sitting alone in his office. His wife enters.]
ROBERT - How are you doing?
SYLVIA- I'm not sure.
ROBERT- Oh.
ROBERT- The guy's in there. He's a Russian defector. He came here last night.
SYLVIA - What did he say?
ROBERT - Well, he's not exactly talking, but he has to be kept alive. And he's got something to tell us.
SYLVIA- What?
ROBERT - I don't know. He came in last night and made these phone calls.
SYLVIA - He's on the run. I'm sorry.
ROBERT - It's not your fault, Sylvia.
Stoner High
GENRE: stoner comedy
THEME: goofy friends smoke pot
TITLE: Stoner High
SUMMARY: A group of high school students try to save a local marijuana grow operation from a sheriff.
SCRIPT:
[Scene: A room at Stoner High. The school principal is looking at a report from the sheriff]
PRINCIPAL - What's going on, Sheriff?
JASON - The grower's name is Randy "Duke" Burdon. He's got a rap sheet a mile long. They found marijuana plants growing in his garage.
PRINCIPAL - Duke Burdon?
JASON - That's the guy. He's been arrested for growing pot twice before, but he always gets off on a technicality.
PRINCIPAL - Are you sure that's him?
JASON - I know this guy.
PRINCIPAL - Okay. I'll find out what I can...
JASON - I don't think that's a good idea.
PRINCIPAL - Why not?
JASON - Because you're a principal. And because you're too old to party, and you don't need to be the target of a bunch of kids.
PRINCIPAL - I'm not going to let it happen.
JASON - And what if it does?
PRINCIPAL - I'll handle it.Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS