SingleStore sponsored this post. Insight Partners is an investor in SingleStore and TNS.
This is the second of two articles.
In the
previous article, we discussed three considerations for developers when building GPT applications with an open source stack, such as
LangChain. Let’s now use LangChain for a practical example where we want to store and analyze PDF documents.
We’ll obtain a PDF document, divide it into smaller parts, save the document text and its vector representations (embeddings*) in a database system and then query it. We’ll also use a GPT to help answer a question.
*In a GPT, an embedding is simply a numerical representation of a word or phrase.
Vectors represent the semantic meaning of words and phrases in a way that a
machine-learning model can understand.
Designed for intelligent applications, SingleStore is the world’s only real-time data platform that can read, write and reason on petabyte-scale data in a few milliseconds. Insight Partners is an investor in SingleStore and TNS.
The latest from SingleStore
Create a SingleStoreDB Cloud Account
First,
sign up for a free SingleStoreDB Cloud account. Once logged in, select
CLOUD > Create new workspace group from the left-hand navigation pane. Next, choose
Create Workspace and just work through the wizard. Here are the recommended settings for this example:
Create Workspace Group
Workspace Group Name: LangChain Demo Group
Cloud Provider: AWS
Region: US East 1 (N. Virginia)
Click
Next.
Create Workspace
Workspace Name: langchain-demo
Size: S-00
Click
Create Workspace.
Once the workspace is created and available, from the left-hand navigation pane, select
DEVELOP > SQL Editor to create a new database, as follows:
CREATE DATABASE IF NOT EXISTS pdf_db;
Create a Notebook
From the left-hand navigation pane, select
DEVELOP > Notebooks. In the top right of the web page, select
New Notebook > New Notebook, as shown in Figure 1 below.
👁 Image
We’ll call the notebook
langchain_demo. Select a
Blank notebook template from the available options.
We’ll also select the
Connection and
Database using the drop-down menus above the notebook, as shown in Figure 2.
👁 ImageFigure 2. Connection and Database
Fill out the Notebook
First, we’ll import some libraries:
!pip install langchain --quiet
!pip install openai --quiet
!pip install pdf2image --quiet
!pip install tabulate --quiet
!pip install tiktoken --quiet
!pip install unstructured --quiet
Next, we’ll read in a PDF document. This is an article by Neal Leavitt titled “Whatever Happened to Object-Oriented Databases?” OODBs were an emerging technology during the late 1980s and early 1990s. We’ll add `leavcom.com` to the firewall by selecting the
Edit Firewall option in the top right. Once the address has been added to the firewall, we’ll read the PDF file:
from langchain.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("http://leavcom.com/pdf/DBpdf.pdf")
data = loader.load()
We can use LangChain’s OnlinePDFLoader, which makes reading a PDF file easier.
Next, we’ll get some data on the document:
from langchain.text_splitter import RecursiveCharacterTextSplitter
print (f"You have {len(data)} document(s) in your data")
print (f"There are {len(data[0].page_content)} characters in your document")
The output should be:
You have 1 document(s) in your data
There are 13040 characters in your document
We’ll now split the document into pages containing 2,000 characters each, giving us seven pages:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000, chunk_overlap = 0)
texts = text_splitter.split_documents(data)
print (f"You have {len(texts)} pages")
Next, we’ll create a table to store the text and embeddings. We can do this directly using the `%%sql` magic command:
%%sql
USE pdf_db;
DROP TABLE IF EXISTS pdf_docs;
CREATE TABLE IF NOT EXISTS pdf_docs (
id INT PRIMARY KEY,
text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
embedding BLOB
);
To use Python code to connect to our database, we can use the built-in `connection_url`, as follows:
from sqlalchemy import *
db_connection = create_engine(connection_url)
We’ll set our OpenAI API Key:
import openai
openai.api_key = "OpenAI API Key"
and use LangChain’s `OpenAIEmbeddings`:
from langchain.embeddings import OpenAIEmbeddings
embedder = OpenAIEmbeddings(openai_api_key = openai.api_key)
Now we are ready to obtain the vector embeddings and store them in the database system:
db_connection.execute("TRUNCATE TABLE pdf_docs")
for i, document in enumerate(texts):
text_content = document.page_content
embedding = embedder.embed_documents([text_content])[0]
stmt = """
INSERT INTO pdf_docs (
id,
text,
embedding
)
VALUES (
%s,
%s,
JSON_ARRAY_PACK_F32(%s)
)
"""
db_connection.execute(stmt, (i+1, text_content, str(embedding)))
We truncate the table to ensure that we start with an empty table. Then we iterate through the pages of text, obtain the embeddings from OpenAI, and store the text and embeddings in the database table.
We can now ask a question, as follows:
query_text = "Will object-oriented databases be commercially successful?"
query_embedding = embedder.embed_documents([query_text])[0]
stmt = """
SELECT
text,
DOT_PRODUCT_F32(JSON_ARRAY_PACK_F32(%s), embedding) AS score
FROM pdf_docs
ORDER BY score DESC
LIMIT 1
"""
results = db_connection.execute(stmt, str(query_embedding))
for row in results:
print(row[0])
Here we convert the question into vector embeddings, perform a `DOT_PRODUCT` and return only the highest-scoring value.
Finally, we can use a GPT to provide an answer, based on the earlier question:
prompt = f"The user asked: {query_text}. The most similar text from the document is: {row[0]}"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
)
print(response['choices'][0]['message']['content'])
Here is some example output:
Based on the information provided in the document, it seems that object-oriented databases are not expected to be commercially successful in the near future. While they are gaining some popularity in niche markets such as CAD and telecommunications, relational databases continue to dominate the market and are expected to do so for the foreseeable future. IDC predicts that the growth rate for relational databases will be significantly higher than that of OO databases through 2004. However, OO databases still have their place in certain niche markets.
Summary
In this example, we saw the benefits of LangChain in the application development process. We also saw how easily we can convert documents from one format to another, store the content in a database system, generate vector embeddings and ask questions about the data stored in the database system. We also have the full power of SQL available if we are interested in performing additional query operations on the data.
I will host a workshop on June 22 and will go through building a ChatGPT application using LangChain. I hope you can join. Sign up
here.
Designed for intelligent applications, SingleStore is the world’s only real-time data platform that can read, write and reason on petabyte-scale data in a few milliseconds. Insight Partners is an investor in SingleStore and TNS.
The latest from SingleStore