VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/12/chatgpt-unlocking-the-potential-of-artificial-intelligence-for-human-like-conversation/

⇱ Understanding ChatGPT and Model Training in Simple Terms


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Understanding ChatGPT and Model Training in Simple Terms

Ganeshi Last Updated : 17 Apr, 2023
6 min read

Introduction

‘Hey, Siri, ‘Hey, Google,’ and ‘Alexa’ are some common voice assistants we use on an everyday basis. These fascinating conversational bots use Natural Language Understanding to understand the inputs. NLU is a subset of Natural Language Processing that enables the machine to understand the natural language (text/audio). NLU is a critical component in most the NLP applications like Machine translation, Speech recognition, Building chatbots, etc. The foundation of NLU is the Language model.

In this article, we will discuss the state-of-the-art language models by Open AI, GPT, and its variants and how it led to the breakthrough of ChatGPT. Some of the points covered in this article include:

  • Learn about ChatGPT and its model training process.
  • Understand the brief history of GPT architectures – GPT 1, GPT 2, GPT 3 and InstructGPT.
  • In-depth understanding of Reinforcement Learning from Human Feedback(RHLF).

Let’s get started!

Overview of GPT Family

The state-of-the-art architecture for language models is transformers. The working of a transformer is no less than magic. OpenAI came up with one such transformer, i.e., a Generative Pre-trained Transformer Model, popularly known as GPT.

GPT is developed in a self-supervised fashion. The model is trained over a massive dataset to predict the next word in the sequence. This is known as casual language modeling. This language model is then finetuned on a supervised dataset for the downstream tasks.

👁 GPT family

OpenAI released three different versions of GPT i.e., GPT-1, GPT-2, and GPT-3, to generate human-like conversations. The 3 versions of GPT differ in size. Each new version was trained by scaling up the data and parameters.

👁 InstructGPT

GPT-3 is referred to as an autoregressive model that is trained to make predictions only by looking at past values. GPT-3 can be used to develop huge applications like search engines, content creation, and many more. But why did GPT-3 fail to achieve human-like conversations? Let’s find out.

Why InstructGPT?

There are 2 primary reasons why GPT-3 failed.

One of the problems with GPT-3 is that the model output is not aligned with the user instructions/prompts. To put it in short, GPT-3 cannot generate a user-preferred response.

For example, given a prompt “Explain the moon landing to a 6-year-old in a few sentences”, GPT-3 generated the unwanted response as shown in the figure below. The main reason behind such responses is that the model is trained to predict the next word in the sentence. GPT-3 is not trained to generate human preferred responses.

👁 InstructGPT

Another problem is that it can generate unsafe and harmful comments as it does not have control over the text.

In order to resolve both of these problems- alignment and harmful comments, a new language model was trained that can address these challenges. We will learn more about it in the next section.

What is InstructGPT?

InstructGPT is a language model that generates the user-preferred response with the intent of safe communication. Hence, it is known as the Language model aligned with the following instructions. It uses a learning algorithm called Reinforcement Learning from Human Feedback (RLHF) to generate safer responses.

Reinforcement Learning from Human Feedback is a deep reinforcement learning technique that takes into account human feedback for learning. Human experts control the learning algorithm by providing the most likely human responses from the list of responses generated by the model. This way, the agent mimics safe and truthful responses.

But why Reinforcement Learning from Human Feedback? Why not traditional Reinforcement Learning systems?

Traditional Reinforcement Learning systems require the reward function to be defined to understand whether the agent is moving in the right direction and aim to maximize the cumulative rewards. But, communicating the reward function to the agent in modern Reinforcement Learning environments is very challenging. Hence, instead of defining the reward function for the agent, we train the agent to learn the reward function based on human feedback. This way, the agent can learn the reward function and understand the environment’s complex behaviors.

In the next section, we will learn about one of the most trending topics in the field of AI – ChatGPT.

Introduction to ChatGPT

ChatGPT is now a buzz in the data science field. ChatGPT is simply a chatbot that mimics human conversations. It can answer any questions given to it and remembers the conversations that happened earlier. For example, given a prompt ‘code for decision tree’, ChatGPT responded with the implementation of the decision tree in python as shown in the figure below. That’s the power of ChatGPT. We will look at more hilarious examples at the end.

👁 ChatGPT

According to Open AI, ChatGPT is a sibling model to InstructGPT, which is trained to follow instructions in a prompt and provide a detailed response. It is a modified version of the InstructGPT with a change in the model training process. It can remember the conversations that happened earlier and then respond accordingly.

Now let’s see how Instruct GPT and ChatGPT are different. Even though Reinforcement Learning from Human Feedback is incorporated, InstructGPT is not fully aligned and thus is still toxic. Hence, this led to the breakthrough of ChatGPT with changes in the data collection setup.

👁 Image

How is ChatGPT built?

ChatGPT is trained similarly to InstructGPT with a change in the data collection. Let’s understand the working of each phase now.

In this first step, we finetune the GPT-3 on the dataset containing a pair of prompts and relevant answers. It is a supervised fine-tuning task. The relevant answers are provided by the expert labeler.

In the next step, we will learn the reward function that helps the agent to decide what is right and wrong and then move in the right direction of the goal. The reward function is learned through human feedback, thus ensuring the model’s generation of safe and truthful responses.

Here is the list of steps involved in the reward modeling task-

  1. Multiple responses are generated for the given prompt
  2. The human labeler compares the list of prompts generated by the model and ranks it from best to worst.
  3. This data is then used to train the model.

In the final step, we will learn the optimal policy against the reward function using the Proximal Policy Optimization algorithm (PPO). PPO is a new class of reinforcement learning techniques introduced by Open AI. The idea behind the PPO is to stabilize the agent training by avoiding too large policy updates.

👁 ChatGPT

Steps involved in model training

Source: https://openai.com/blog/chatgpt/

Hilarious Prompts of ChatGPT

Now, we will look at some of the hilarious prompts generated by ChatGPT.

Prompt 1:

👁 ChatGPT

Prompt 2:

👁 ChatGPT

Prompt 3:

👁 ChatGPT

Conclusion

This brings us to the end of the article. In this article, we discussed ChatGPT and how it is trained using Deep Reinforcement Learning techniques. We also covered a brief history of GPT variants and how they led to ChatGPT.

ChatGPT is an absolute sensation in the history of AI, but there is a lot more to it to achieve human intelligence. You can try ChatGPT here.

Hope you liked the article. Please let me know your thoughts and views on ChatGPT in the comments below.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Karthik

Great Effort! Found it very informative!!

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner