VOOZH about

URL: https://dev.to/googleai/building-knowledge-graphs-with-gemini-3ail

⇱ Building Knowledge Graphs with Gemini - DEV Community


✨ Overview

In this exploration, we'll see how to turn raw, unstructured documents into structured knowledge graphs using Gemini. We'll start by prototyping to develop our intuition. Then, we'll optimize our prompts and outputs, and finally scale up to process entire books or dense legal contracts. By the end, we'll even visualize extracted book narratives and contractual network graphs!


A few notes before we start:

  • I'm a software engineer, developer advocate at Google Cloud, and hope you'll learn a few things. Thoughts and opinions are entirely my own.
  • The complete source code is available in this notebook (including setup details and future updates) under the Apache 2.0 license. You can also directly open the notebook in Colab. This article reproduces all the results generated by a click on “Run all”.
  • You can experiment and build for free with Gemini in the following platforms:

🔥 Challenge

Documents are everywhere. We use them for business, daily operations, legal matters, technical docs, education, and even just for fun. However, documents are not databases. They're generally unstructured, and fully understanding them requires multiple reading passes.

So, can we extract structured knowledge from documents using only the following?

  • 1 document
  • 1 prompt
  • 1 request

Let's try with Gemini…


🏁 Setup

🐍 Python packages

We'll use the following packages:

We'll also need:

  • tenacity for request management (a dependency of google-genai)
  • matplotlib and pillow for data visualization (dependencies of networkx)
%pip install --quiet "google-genai>=2.6.0" "networkx[default]"

🤝 Gemini API

To use the Gemini API, we have two main options:

  1. Via Agent Platform (formerly Vertex AI) with a Google Cloud project
  2. Via Google AI Studio with a Gemini API key

🤖 Gen AI SDK

To send Gemini requests, we'll use a google.genai client:

from google import genai

check_environment()

client = genai.Client()

check_configuration(client)
✅ Using the Agent Platform API with project "lpdemo-…" in location "global"

🔣 Input data

We need a suite of test data to develop our solution.

Multimodality

We'll test the following types:

  • Text (text/plain): Classic books are good text sources of varying lengths and languages.
  • PDF (application/pdf): Legal agreements are also great examples of complex and dense documents.

Gemini is natively multimodal, which means it can process different types of inputs. Once we've built knowledge graphs from text or PDF inputs, the solution will also naturally support the following formats:

  • Image (image/*)
  • Audio (audio/*)
  • Video (video/*)

General knowledge

⚠️ LLMs are trained on general knowledge, which becomes part of their "long-term memory". To avoid generating memorized information, we'll explicitly instruct the model to use only the provided inputs.

Multilinguality

Gemini is also natively multilingual, which lets us process inputs and generate outputs in 100+ languages.

To keep things general, we'll use English for prompts and knowledge graphs, but you can use any of the 100+ supported languages, as long as your prompts remain clear and explicit.


🧠 Gemini model

Gemini comes in different versions and sizes (Flash-Lite, Flash, and Pro).

Let's get started with Gemini 3.1 Flash-Lite, as it offers high performance, low latency, and very high output speed:

  • GEMINI_3_1_FLASH_LITE = "gemini-3.1-flash-lite"

⚙️ Gemini configuration

Gemini can be used in different ways, ranging from factual to creative modes. We're essentially dealing with a data-extraction use case. We want the results to be as factual and deterministic as possible. To achieve this, we can adjust the content generation parameters.

We'll set the temperature, top_p, and seed parameters to minimize randomness:

  • temperature=0.0
  • top_p=0.0
  • seed=42 (arbitrary fixed value)

🛠️ Helpers


🏗️ Prototyping

Before diving into a solution, it helps to start by prototyping to build some intuition about the natural behavior of the model.

Let's define a short text of a few sentences:

text = """
- Henry Jones is a famous archaeologist. He is actually a "Junior" because he is named after his father.
- Sophie is Henry's daughter, shares his last name, and works as a software engineer.
- William Smith is an aerospace engineer and Sophie's lifelong friend. Everybody calls him Bill and Beau is his dog.
- Short Round met Henry as a child. They first became close friends, and Henry officially adopted him a few years later.
- Sophie and Bill both work at Acme Aerospace.
"""

📋 Listing characters

🧪 Let's see if Gemini can spot our characters…

prompt = """
Using only the input data, list all people and animals mentioned.
"""
generate_content(prompt, text)
----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 148
Output tokens : 67
------------------------------ Start of Response -------------------------------
Based on the input data provided, here are the people and animals mentioned:

**People:**
* Henry Jones (also known as Henry Jones Junior)
* Sophie Jones
* William Smith (also known as Bill)
* Short Round

**Animals:**
* Beau (Bill's dog)
------------------------------- End of Response --------------------------------

💡 All people and animals are detected as expected.


📋 Listing characters & relationships

🧪 Now, let's see if it can connect the dots and figure out who's who…

prompt = """
Using only the input data, list all people and animals mentioned, and how they relate to each other.
"""
generate_content(prompt, text)
----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 156
Output tokens : 168
------------------------------ Start of Response -------------------------------
Based on the input data provided, here are the people and animals mentioned and their relationships:

**People:**
* **Henry Jones (Junior):** A famous archaeologist. He is the father of Sophie, the adoptive father of Short Round, and is named after his own father.
* **Sophie Jones:** A software engineer at Acme Aerospace. She is the daughter of Henry Jones and a lifelong friend of Bill (William Smith).
* **William (Bill) Smith:** An aerospace engineer at Acme Aerospace. He is a lifelong friend of Sophie and the owner of Beau.
* **Short Round:** The adopted son of Henry Jones. He met Henry as a child and they became close friends before the adoption.

**Animals:**
* **Beau:** A dog owned by William (Bill) Smith.
------------------------------- End of Response --------------------------------

💡 Notes

  • The relationships between the characters can also be neatly consolidated.
  • Henry's father is not listed as a named character. Here, we can see the difference between explicit and implicit mentions. By default, only explicit entities are detailed (most likely because models are trained to summarize information).
  • Implied characters (like "he is named after his father") represent a level of indirection. We'll see in the next section how to detect them as well.

📚 Domain terminology

We're not domain experts in the field we're exploring (yet!).

An LLM processes instructions based on the given prompt and its training knowledge. This knowledge is part of its long-term memory, and we can learn a lot directly from the model itself.

🧪 Let's ask Gemini:

prompt = """
What is the terminology used when building a knowledge graph?
Please provide a simple data example in JSON.
"""
generate_content(prompt)
----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 21
Output tokens : 581
------------------------------ Start of Response -------------------------------
Building a knowledge graph involves representing information as a network of interconnected entities. Here is the core terminology and a simple data example.

### Core Terminology

1. **Entity (Node):** The "things" in your graph (e.g., a person, a place, a product).
2. **Relationship (Edge/Link):** The connection between two entities. It describes how they interact (e.g., "works at," "lives in," "is a friend of").
3. **Property (Attribute):** Key-value pairs that provide more detail about an entity or a relationship (e.g., a person's "age" or a relationship's "start_date").
4. **Label:** A category assigned to a node or edge to define its type (e.g., a node might have the label "Person").
5. **Schema (Ontology):** The formal structure or "blueprint" that defines the types of entities allowed and the rules for how they can be connected.
6. **Triple:** The fundamental unit of a knowledge graph, consisting of a **Subject → Predicate → Object** (e.g., *Alice**works at**Google*).

---

### Simple Data Example (JSON)

In a knowledge graph, data is often represented as a collection of **Nodes** and **Edges**.

```json
{
"nodes": [
 {
 "id": "1",
 "label": "Person",
 "properties": {
 "name": "Alice",
 "age": 30
 }
 },
 {
 "id": "2",
 "label": "Company",
 "properties": {
 "name": "Google",
 "industry": "Technology"
 }
 }
],
"edges": [
 {
 "id": "e1",
 "source": "1",
 "target": "2",
 "label": "WORKS_AT",
 "properties": {
 "since": 2020
 }
 }
]
}
```

### Breakdown of the Example:

* **Nodes:** We have two entities: "Alice" (a Person) and "Google" (a Company).
* **Edge:** We have one relationship: "WORKS_AT" connecting Alice to Google.
* **Properties:** We stored specific details like Alice's age and the year she started working at Google.
* **Triple representation:** This JSON effectively encodes the triple: **(Alice) —[WORKS_AT]—> (Google)**.
------------------------------- End of Response --------------------------------

💡 We learn that knowledge graphs are made of entities and relationships, also called nodes and edges, and we get a nice introduction to the field. Using domain terminology will help make our prompts explicit and precise.


⛏️ Tabular extraction

To extract knowledge graphs, we'll reason in terms of entities and relationships, adopting domain terminology.

If we think of the final result as a database, our goal is to generate two linked tables, allowing us to reason in terms of data and fields.

Here is a conceptual view of what we want to achieve:

Entities

id name label
0 Henry Jones Jr. person
1 Henry Jones Sr. person

Relationships

source_id link target_id
0 child_of 1

Let's call this approach "tabular extraction" and split our instructions to output two successive tables, while still using a single request…


💠 Entities

In our prototype text, the entities we want to extract are characters (like people or animals). We can define an entity data schema with the fields id (0, 1, 2…), name (full name of the entity), and label (person|animal).

🧪 Let's extract the entities:

prompt = """
**Data Schema**

Entity:
- `id`: Unique integer identifier (0, 1, 2…).
- `name`: Full name of the entity.
- `label`: `person`|`animal`.

**Instructions**

1. Entity Extraction:
 - Extract every distinct entity from the input data that matches an allowed `label`.
 - Include entities that are explicitly named as well as implied entities whose names can be determined from the context.
2. Output the results as a JSON array inside a fenced code block.
"""

generate_content(prompt, text)
----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 249
Output tokens : 195
------------------------------ Start of Response -------------------------------
[{"id":0,"name":"Henry Jones Jr.","label":"person"},{"id":1,"name":"Henry Jones Sr.","label":"person"},{"id":2,"name":"Sophie Jones","label":"person"},{"id":3,"name":"William Smith","label":"person"},{"id":4,"name":"Beau","label":"animal"},{"id":5,"name":"Short Round","label":"person"}]
------------------------------- End of Response --------------------------------

💡 Remarks

  • Every entity is dynamically assigned a unique sequential identifier.
  • The name "Henry Jones Sr." is extracted even though he's not explicitly mentioned in the text. This stems from the instruction to "include entities that are explicitly named as well as implied entities whose names can be determined from the context".

🔗 Relationships

🧪 Now, let's extract both the entities and their relationships:

prompt = """
**Data Schema**

Entity:
- `id`: Unique integer identifier (0, 1, 2…).
- `name`: Full name of the entity.
- `label`: `person`|`animal`.

Relationship:
- `source_id`: `id` of the subject entity.
- `link`: `snake_case` predicate describing the relationship.
- `target_id`: `id` of the object entity.

**Instructions**

1. Entity Extraction:
 - Extract every distinct entity from the input data that matches an allowed `label`.
 - Include entities that are explicitly named as well as implied entities whose names can be determined from the context.
2. Relationship Extraction:
 - Extract every distinct relationship between them.
 - If a relationship changes over time, make sure to include every distinct stage of the relationship.
3. Output a JSON object with keys `entities` and `relationships` inside a fenced code block.
"""

response = generate_content(prompt, text, return_response=True)
----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 340
Output tokens : 456
------------------------------ Start of Response -------------------------------
{"entities":[{"id":0,"name":"Henry Jones Jr.","label":"person"},{"id":1,"name":"Henry Jones Sr.","label":"person"},{"id":2,"name":"Sophie Jones","label":"person"},{"id":3,"name":"William Smith","label":"person"},{"id":4,"name":"Beau","label":"animal"},{"id":5,"name":"Short Round","label":"person"}],"relationships":[{"source_id":0,"link":"child_of","target_id":1},{"source_id":2,"link":"child_of","target_id":0},{"source_id":3,"link":"friend_of","target_id":2},{"source_id":4,"link":"pet_of","target_id":3},{"source_id":5,"link":"friend_of","target_id":0},{"source_id":0,"link":"adopted","target_id":5},{"source_id":5,"link":"child_of","target_id":0}]}
------------------------------- End of Response --------------------------------

💡 Remarks

  • The output now includes the relationships array.
  • We get the expected number of relationships, notably the two relationships between Henry and Short Round.
  • The link predicates are completely dynamic (a level of flexibility we left in the prompt). While it's interesting to see this natural behavior, we'll want to make it more deterministic for production since our prompt has too many degrees of freedom.
  • Our open prompt extracts relationships in only one direction. For example, "[Animal] pet_of [Person]" is an asymmetric relationship that could also be extracted as "[Person] owner_of [Animal]". This is another area where the prompt is too open-ended. In the finalization section, we'll see an example that asks the model to extract symmetric and asymmetric relationships in both directions.

💎 Structured output

We've concluded our prototyping stage with promising results using a data schema.

To move to production, the next step is to control the generation with a specific structured output.


🗂️ JSON

The JSON format has industry-wide support and serves as a core or intermediate format in many use cases.

For the next step, we would typically define classes using the Pydantic library and request a pure JSON output with a response schema in the config parameters:

  • response_mime_type="application/json"
  • response_schema="CLASS_DERIVED_FROM_PYDANTIC_BASE_MODEL" (docs)

⚠️ However, JSON is a pretty verbose format, designed for interoperability but not optimized for size. Even if we generate compact JSON (also called minified JSON), it still has inherent verbosity due to:

  • repeated keys (object fields) for each object instance
  • opening and closing brackets
  • enclosing quotes for each key and string value

ℹ️ When using LLMs, once the first token is generated, the remaining generation time is roughly proportional to the number of output tokens. Similarly, the cost of a request is based on token usage (input + output), with output tokens being significantly more expensive than input tokens.

💡 A better output structure will positively impact both generation speed and request cost.

Let's explore an alternative…


📅 TSV

Our tabular-extraction problem clearly calls for table outputs. An interesting possibility is to ask for Tab-Separated Values (TSV) outputs. For example, we can define our output to be formatted as two consecutive TSV tables.

Example output format

```tsv filename="entities.tsv"
id{TAB}name{TAB}label
[rows]
```

```tsv filename="relationships.tsv"
source_id{TAB}link{TAB}target_id
[rows]
```

Will this work?

Generating structured outputs like TSV will work seamlessly, as Gemini excels at patterns. We just need to be explicit about what's expected.

Will this be efficient?

💡 For our use case, this structure looks optimal:

  • Field names are stated once in the header row.
  • Field values are raw, without outer quotes.
  • Tab and newline separators take at most a single token each and do not collide with field values (unless we're dealing with very special input data, but we could then prompt the model to escape these using different characters or strings).

ℹ️ CSV could be another alternative, but common separators like commas are everywhere in natural language and frequently appear in names and descriptions (e.g., if we decide to extend entity fields). If you're interested in this topic, check out the TOON format, which proposes a JSON alternative using a YAML+CSV mix.

To make an informed decision, we should actually compare the number of tokens needed to represent the same data…


🗜️ Token comparison


🧪 First, let's compare how much data we can represent for the same number of characters based on our latest API response:

compare_json_vs_tsv(response, only_show_excerpts=True)
----------------- Formatted JSON - First 335/1122 chars (30%) ------------------
```json
{
 "entities": [
 {
 "id": 0,
 "name": "Henry Jones Jr.",
 "label": "person"
 },
 {
 "id": 1,
 "name": "Henry Jones Sr.",
 "label": "person"
 },
 {
 "id": 2,
 "name": "Sophie Jones",
 "label": "person"
 },
 {
 "id": 3,
 "name": "William Smith",
 …
------------------- Compact JSON - First 335/665 chars (50%) -------------------
```json
{"entities":[{"id":0,"name":"Henry Jones Jr.","label":"person"},{"id":1,"name":"Henry Jones Sr.","label":"person"},{"id":2,"name":"Sophie Jones","label":"person"},{"id":3,"name":"William Smith","label":"person"},{"id":4,"name":"Beau","label":"animal"},{"id":5,"name":"Short Round","label":"person"}],"relationships":[{"source_i…
------------------------------- TSV (335 chars) --------------------------------
```tsv filename="entities.tsv"
id name label
0 Henry Jones Jr. person
1 Henry Jones Sr. person
2 Sophie Jones person
3 William Smith person
4 Beau animal
5 Short Round person
``````tsv filename="relationships.tsv"
source_id link target_id
0 child_of 1
2 child_of 0
3 friend_of 2
4 pet_of 3
5 friend_of 0
0 adopted 5
5 child_of 0
```

💡 Notice how much more information can be represented in the same number of text characters. This will apply similarly to token counts.


🧪 And now, let's compare the gains, especially for token counts:

compare_json_vs_tsv(response)
------------------------ Formatted JSON → Compact JSON -------------------------
Chars Tokens
Formatted JSON 1122 456
Compact JSON 665 216
Gain 40.7% 52.6%
------------------------------ Compact JSON → TSV ------------------------------
Chars Tokens
Compact JSON 665 216
TSV 335 137
Gain 49.6% 36.6%
----------------------------- Formatted JSON → TSV -----------------------------
Chars Tokens
Formatted JSON 1122 456
TSV 335 137
Gain 70.1% 70.0%
--------------------------------------------------------------------------------

💡 Savings in output tokens:

  • 30 to 40% compared to compact JSON
  • 60 to 70% compared to standard JSON

With a double-digit percentage reduction in output tokens, building knowledge graphs with TSV outputs is significantly faster (and cheaper)!

Now, let's finalize our code with optimized structures…


🚀 Finalization

🧩 Structure

First, it helps to define a structured prompt template, so we can focus on specific parts of our solution using a divide-and-conquer approach.

Here's a possible prompt template:

KNOWLEDGE_GRAPH_PROMPT_TEMPLATE = """
**Data Schema**

{data_schema}

**Instructions**

{instructions}

**Output Format**

{output_format}
"""

Then, here are some possible Entity, Relationship, and KnowledgeGraph data classes with the matching output format:

from dataclasses import dataclass, field


@dataclass
class Entity:
 id: int
 name: str
 label: str


@dataclass
class Relationship:
 source_id: int
 link: str
 target_id: int


@dataclass
class KnowledgeGraph:
 entities: list[Entity] = field(default_factory=list)
 relationships: list[Relationship] = field(default_factory=list)


TAB = "\t"
KNOWLEDGE_GRAPH_OUTPUT_FORMAT = f"""
Format the output strictly as two TSV code blocks (including the header row):

```tsv filename="entities.tsv"
id{TAB}name{TAB}label
[data_rows]
```

```tsv filename="relationships.tsv"
source_id{TAB}link{TAB}target_id
[data_rows]
```
"""

💡 While the Gen AI SDK natively supports Pydantic models for JSON structured outputs, we're using standard Python data classes here and TSV outputs to maximize our token efficiency.

ℹ️ If you use multiple entity or relationship data classes in your solution, you can dynamically generate the output format specification using features of the dataclasses package (like class docstrings and field descriptions).


And here is a possible data schema with some instructions to generate a knowledge graph for our book analysis use case:

from enum import StrEnum, auto


class BookAnalysisEntityLabel(StrEnum):
 PERSON = auto()
 ANIMAL = auto()
 ORGANIZATION = auto()


def pipe_delimited_union(enum: type[StrEnum]) -> str:
 return "|".join(f"`{e.value}`" for e in enum)


BOOK_ANALYSIS_DATA_SCHEMA = f"""
Entity:
- `id`: Unique integer identifier (0, 1, 2…).
- `name`: Most complete name as exclusively determined from the input data.
- `label`: {pipe_delimited_union(BookAnalysisEntityLabel)}.

Relationship:
- `source_id`: `id` of the subject entity.
- `link`: `snake_case` predicate.
- `target_id`: `id` of the object entity.
"""

BOOK_ANALYSIS_INSTRUCTIONS = """
- Extract every distinct entity:
 - Treat distinct pseudonyms/identities as separate entities.
 - Include implied entities whose names can be exclusively determined from the input data.
- Extract every distinct relationship between them:
 - Use specific `link` predicates in `snake_case` as needed (e.g., `alias_of`, `son_of`, `fiancée_of`, `friend_of`, `murderer_of`, `employer_of`, `in_love_with`, `rival_of`).
 - If a relationship changes over time, make sure to include every distinct stage of the relationship.
 - For every asymmetric relationship extracted, make sure to include the logical inverse relationship (e.g., `A husband_of B` AND `B wife_of A`, `A employer_of B` AND `B employee_of A`).
 - For every symmetric relationship extracted, make sure to include both directions (e.g., `A friend_of B` AND `B friend_of A`).
"""

Verify the structured prompt:

show_knowledge_graph_prompt(
 BOOK_ANALYSIS_DATA_SCHEMA,
 BOOK_ANALYSIS_INSTRUCTIONS,
 text,
 show_as=ShowAs.TEXT,
)
------------------------------------ Prompt ------------------------------------
==Start of input data==
- Henry Jones is a famous archaeologist. He is actually a "Junior" because he is named after his father.
- Sophie is Henry's daughter, shares his last name, and works as a software engineer.
- William Smith is an aerospace engineer and Sophie's lifelong friend. Everybody calls him Bill and Beau is his dog.
- Short Round met Henry as a child. They first became close friends, and Henry officially adopted him a few years later.
- Sophie and Bill both work at Acme Aerospace.
==End of input data==
==Start of user prompt==
**Data Schema**

Entity:
- `id`: Unique integer identifier (0, 1, 2…).
- `name`: Most complete name as exclusively determined from the input data.
- `label`: `person`|`animal`|`organization`.

Relationship:
- `source_id`: `id` of the subject entity.
- `link`: `snake_case` predicate.
- `target_id`: `id` of the object entity.

**Instructions**

- Extract every distinct entity:
 - Treat distinct pseudonyms/identities as separate entities.
 - Include implied entities whose names can be exclusively determined from the input data.
- Extract every distinct relationship between them:
 - Use specific `link` predicates in `snake_case` as needed (e.g., `alias_of`, `son_of`, `fiancée_of`, `friend_of`, `murderer_of`, `employer_of`, `in_love_with`, `rival_of`).
 - If a relationship changes over time, make sure to include every distinct stage of the relationship.
 - For every asymmetric relationship extracted, make sure to include the logical inverse relationship (e.g., `A husband_of B` AND `B wife_of A`, `A employer_of B` AND `B employee_of A`).
 - For every symmetric relationship extracted, make sure to include both directions (e.g., `A friend_of B` AND `B friend_of A`).

**Output Format**

Format the output strictly as two TSV code blocks (including the header row):

```tsv filename="entities.tsv"
id name label
[data_rows]
```

```tsv filename="relationships.tsv"
source_id link target_id
[data_rows]
```
==End of user prompt==
--------------------------------------------------------------------------------

🧪 Let's generate a knowledge graph:

knowledge_graph = generate_knowledge_graph(
 BOOK_ANALYSIS_DATA_SCHEMA,
 BOOK_ANALYSIS_INSTRUCTIONS,
 text,
 show_response=ShowAs.TEXT,
)

print(knowledge_graph.entities)
print(knowledge_graph.relationships)
----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 534
Output tokens : 244
------------------------------ Start of Response -------------------------------
```tsv filename="entities.tsv"
id name label
0 Henry Jones Jr. person
1 Henry Jones Sr. person
2 Sophie Jones person
3 William Smith person
4 Bill person
5 Beau animal
6 Short Round person
7 Acme Aerospace organization
```

```tsv filename="relationships.tsv"
source_id link target_id
0 son_of 1
1 father_of 0
0 father_of 2
2 daughter_of 0
2 friend_of 3
3 friend_of 2
3 alias_of 4
4 alias_of 3
3 employee_of 7
7 employer_of 3
2 employee_of 7
7 employer_of 2
3 owner_of 5
5 pet_of 3
6 friend_of 0
0 friend_of 6
0 adopted_father_of 6
6 adopted_son_of 0
```
------------------------------- End of Response --------------------------------
----------------------------- Knowledge Graph Info -----------------------------
Entities : 8
Relationships : 18
--------------------------------------------------------------------------------
[Entity(id=0, name='Henry Jones Jr.', label='person'), Entity(id=1, name='Henry Jones Sr.', label='person'), Entity(id=2, name='Sophie Jones', label='person'), Entity(id=3, name='William Smith', label='person'), Entity(id=4, name='Bill', label='person'), Entity(id=5, name='Beau', label='animal'), Entity(id=6, name='Short Round', label='person'), Entity(id=7, name='Acme Aerospace', label='organization')]
[Relationship(source_id=0, link='son_of', target_id=1), Relationship(source_id=1, link='father_of', target_id=0), Relationship(source_id=0, link='father_of', target_id=2), Relationship(source_id=2, link='daughter_of', target_id=0), Relationship(source_id=2, link='friend_of', target_id=3), Relationship(source_id=3, link='friend_of', target_id=2), Relationship(source_id=3, link='alias_of', target_id=4), Relationship(source_id=4, link='alias_of', target_id=3), Relationship(source_id=3, link='employee_of', target_id=7), Relationship(source_id=7, link='employer_of', target_id=3), Relationship(source_id=2, link='employee_of', target_id=7), Relationship(source_id=7, link='employer_of', target_id=2), Relationship(source_id=3, link='owner_of', target_id=5), Relationship(source_id=5, link='pet_of', target_id=3), Relationship(source_id=6, link='friend_of', target_id=0), Relationship(source_id=0, link='friend_of', target_id=6), Relationship(source_id=0, link='adopted_father_of', target_id=6), Relationship(source_id=6, link='adopted_son_of', target_id=0)]

💡 This is looking good!

Now, let's go to the next stage and build a network graph from our data…


🪢 Network graph

Now that we have our entities and relationships neatly packed into data classes, let's bring them to life. We'll use networkx to build a network graph. Using domain terminology, entities become nodes and relationships become directed edges. We'll also calculate node centralities to identify key nodes and use the Louvain method to detect communities (clusters of closely related nodes)…

Let's test this:

graph_data = GraphData(knowledge_graph)

print(f"{graph_data.graph = !s}")
print(f"{graph_data.nodes = }")
graph_data.graph = DiGraph with 8 nodes and 16 edges
graph_data.nodes = ['2_sophie_jones', '3_william_smith', '7_acme_aerospace', '5_beau', '4_bill', '0_henry_jones_jr.', '1_henry_jones_sr.', '6_short_round']

🎨 Data visualization

The extracted data is much easier to understand when you can actually see it! We can use matplotlib to draw our network graphs. We'll size the nodes based on their centrality and color-code them by community. To make it even easier to digest, we'll generate an animated sequence highlighting each character's connections one by one…


Let's test this:

display_knowledge_graph(knowledge_graph, text)

👁 png

💡 We can now quickly visualize and understand our knowledge graphs. This helps us iterate faster when refining prompts.

ℹ️ While this simple approach is great for a quick overview, you might want to swap it out for more specialized libraries if you want to explore the graph interactively.


✅ Challenge completed

Let's define a book analysis function:

def analyze_book(
 source: Source,
 model: Model | None = None,
 animated: bool = False,
) -> None:
 extract_knowledge_graph(
 BOOK_ANALYSIS_DATA_SCHEMA,
 BOOK_ANALYSIS_INSTRUCTIONS,
 source,
 model,
 domain="Character Connections",
 animated=animated,
 )

🧪 First, let's build a knowledge graph based on Zola's Thérèse Raquin:

analyze_book(Classic.fr_zola_thérèse_raquin)

Input data (fr_zola_thérèse_raquin)

----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 119,518
Cached tokens : 118,753
Output tokens : 357
----------------------------- Knowledge Graph Info -----------------------------
Entities : 11
Relationships : 28
--------------------------------------------------------------------------------

👁 png

💡 We can extract and understand the book's narrative in seconds: The love triangle between Thérèse, Camille, and Laurent clearly stands out. Despite being over 200 pages long (and 100k+ tokens), this novel is incredibly minimalistic, revolving around a limited set of characters, which is reflected in the network graph. Adding locations to the extracted entities would also reveal the claustrophobic atmosphere of the book in the resulting knowledge graph.

ℹ️ We used the original French version with English instructions, which works seamlessly. If you translate the instructions, you can also generate dynamic relationship links in different languages (see the 100+ supported languages).


🧪 Let's see how the model handles the interconnected cast of Hugo's Les Misérables:

analyze_book(
 Classic.en_hugo_les_misérables,
 model=Model.GEMINI_3_5_FLASH,
)

Input data (en_hugo_les_misérables)

-------------------------- Request / gemini-3.5-flash --------------------------
Input tokens : 803,328
Output tokens : 2,908
----------------------------- Knowledge Graph Info -----------------------------
Entities : 65
Relationships : 225
--------------------------------------------------------------------------------

👁 png

💡 We quickly get the gist of the novel.

ℹ️ Donald Knuth famously used Les Misérables as an example back in 1994 (see the Stanford GraphBase). It's been consistently used in Natural Language Processing (NLP) because of its massive scale, linguistic complexity, and multiple high-quality translations. At 800k+ tokens, this book is still a true stress test. Note that we used Gemini 3.5 Flash (instead of 3.1 Flash-Lite). Larger models can infer more and perform deeper consolidation.


🧪 Now, let's process just the first volume of Le Comte de Monte-Cristo (in French):

analyze_book(Classic.fr_dumas_comte_de_monte_cristo_1)

Input data (fr_dumas_comte_de_monte_cristo_1)

----------------------- Request / gemini-3.1-flash-lite ------------------------
Input tokens : 215,582
Output tokens : 713
----------------------------- Knowledge Graph Info -----------------------------
Entities : 36
Relationships : 36
--------------------------------------------------------------------------------

👁 png

💡 The graph shows the initial setup of Edmond Dantès' world, his friends, and his betrayers. The love triangle between Edmond, Mercédès, and Fernand also comes to light.

ℹ️ Note that the antagonist is referred to only as Fernand, which is expected. We learn his full name, Fernand Mondego, only in volume three. In the prompt, we asked the model to "extract" entities and determine names "from the input data" to avoid generating memorized knowledge. To detect regressions when updating the prompt, we can use this as a unit test to ensure that our response actually provides extracted data and not memorized information.


🧪 And finally, let's analyze the entire saga of Le Comte de Monte-Cristo, all four volumes at once (800k+ tokens):

analyze_book(
 Collection.fr_dumas_comte_de_monte_cristo,
 model=Model.GEMINI_3_5_FLASH,
 animated=True,
)

Input data (fr_dumas_comte_de_monte_cristo_1, fr_dumas_comte_de_monte_cristo_2, fr_dumas_comte_de_monte_cristo_3, fr_dumas_comte_de_monte_cristo_4)

-------------------------- Request / gemini-3.5-flash --------------------------
Input tokens : 840,385
Cached tokens : 835,551
Output tokens : 2,659
----------------------------- Knowledge Graph Info -----------------------------
Entities : 50
Relationships : 212
--------------------------------------------------------------------------------

👁 Animated Knowledge Graph

💡 The multiple aliases of Edmond Dantès (and other characters) are nicely extracted. The complex plot of this book is built on characters juggling multiple identities in a psychological chess match.

🎉 This is an example of how we can complete the challenge in a single request, using the fewest tokens possible and zero thinking tokens, while providing multiple levels of consolidation.

ℹ️ Note that this is a proof of concept. For an exhaustive extraction in a fully professional solution, we would probably set up a multi-step workflow and consider the following:

  • As a preliminary step, cache the input data (cached tokens get a 90% discount; learn more about context caching).
  • In the first request, based on the cached content, focus on extracting the entities (possibly extending the entity labels with location, date…), and then save them to a database.
  • In the second request, based on the cached content and the entity table, focus on extracting the stated relationships using additional fields (such as evidence or excerpt to serve as direct proof, or chapter or page for source attribution). In this step, we could also dig deeper and differentiate literal/figurative relationships (e.g., biological vs. figurative parents).
  • In subsequent requests, focus on additional extractions (e.g., interactions between entities) or specific consolidations (e.g., adding inverse relationships, classifying the relationships or interactions into categories).

🔭 Generalization

With our current setup, generalizing to other types of content is as simple as adapting our data schema and instructions.

For example, legal agreements are among the densest types of documents. They're usually made up of many articles and clauses, with every sentence carrying specific legal weight, outlining obligations, or providing definitions. But what kind of knowledge graph do we want to build from a legal agreement?


🧪 Let's test a minimalistic, open-ended prompt:

source = Document.en_pharma_dev_agreement
model = Model.GEMINI_3_5_FLASH

AGREEMENT_OPEN_DATA_SCHEMA = """
Entity:
- `id`: Unique integer identifier (0, 1, 2…).
- `name`: Name of the entity.
- `label`: Entity type.

Relationship:
- `source_id`: `id` of the subject entity.
- `link`: `snake_case` predicate.
- `target_id`: `id` of the object entity.
"""

AGREEMENT_OPEN_INSTRUCTIONS = """
Perform a comprehensive, highly-granular entity/relationship extraction.
"""

extract_knowledge_graph(
 AGREEMENT_OPEN_DATA_SCHEMA,
 AGREEMENT_OPEN_INSTRUCTIONS,
 source,
 model,
 domain="Agreement High-Level Extraction",
)

Input data (en_pharma_dev_agreement)

-------------------------- Request / gemini-3.5-flash --------------------------
Input tokens : 51,276
Output tokens : 221
----------------------------- Knowledge Graph Info -----------------------------
Entities : 10
Relationships : 9
--------------------------------------------------------------------------------

👁 png

💡 Remarks

  • Notice how we just passed a PDF directly! Gemini processes the document natively, automatically extracting text and images while performing OCR along the way.
  • This minimalistic prompt offers the greatest degree of freedom, as the model can't guess exactly what data you want or how you plan to use it. As a result, only high-level entities and relationships are extracted, providing a nice summary of the agreement.
  • LLMs are trained to synthesize information. With such highly open-ended instructions, the default behavior helps avoid generating unnecessary tokens.

🧪 Then, let's try semi-open instructions focusing on legal obligations:

class AgreementEntityLabel(StrEnum):
 PARTY = auto()
 PERSON = auto()
 ROLE = auto()
 LOCATION = auto()
 JURISDICTION = auto()
 FINANCIAL_AMOUNT = auto()
 DATE = auto()
 EVENT = auto()
 ASSET = auto()
 PRODUCT = auto()
 INTELLECTUAL_PROPERTY = auto()
 OBLIGATION_TYPE = auto()


AGREEMENT_SEMI_OPEN_DATA_SCHEMA = f"""
Entity:
- `id`: Unique integer identifier (0, 1, 2…).
- `name`: Name of the entity.
- `label`: {pipe_delimited_union(AgreementEntityLabel)}.

Relationship: Obligation, right, or transfer from a source entity to a target entity.
- `source_id`: The `id` of the subject entity.
- `link`: Specific `snake_case` predicate.
- `target_id`: The `id` of the object entity.
"""

AGREEMENT_SEMI_OPEN_INSTRUCTIONS = """
- Analyze the input data for every covenant (e.g., "shall", "will", "must", "is obligated to") and perform an exhaustive extraction.
- Make sure to deconstruct complex obligations: For complex clauses (e.g., "A shall pay B $X Million within Y days of the Effective Date"), extract:
 - The primary obligation: `[A] is_obligated_to_pay [B]`
 - The value: `[A's obligation] subject_to [$X Million]`
 - The trigger: `[A's obligation] triggered_by [Effective Date]`
"""

extract_knowledge_graph(
 AGREEMENT_SEMI_OPEN_DATA_SCHEMA,
 AGREEMENT_SEMI_OPEN_INSTRUCTIONS,
 source,
 model,
 domain="Agreement Obligations",
)

Input data (en_pharma_dev_agreement)

-------------------------- Request / gemini-3.5-flash --------------------------
Input tokens : 51,466
Output tokens : 2,002
----------------------------- Knowledge Graph Info -----------------------------
Entities : 85
Relationships : 93
--------------------------------------------------------------------------------

👁 png

💡 Remarks

  • This semi-open prompt lists the types of entities to extract, which naturally increases the number of extracted entities and their specificity.
  • The resulting graph is denser and reflects the legal complexity of the document, rather than just giving us a high-level summary.
  • The extracted relationships are still rather high-level due to how open-ended this part of our prompt is. For such specific extractions, it's possible to extend the entity and relationship data schemas with additional fields, or even to request specific tabular outputs.

What if we don't care about legal obligations, but rather the document's architecture? Let's shift our focus to the structure itself and extract how sections, clauses, and defined terms are hierarchically organized…


🧪 And now, let's test closed instructions focusing on the document structure:

class AgreementStructureEntityLabel(StrEnum):
 DEFINED_TERM = auto()
 DOCUMENT_SECTION = auto()
 DOCUMENT = auto()


class AgreementStructureRelationshipType(StrEnum):
 DEFINED_IN = auto()
 CONTAINS = auto()


AGREEMENT_STRUCTURAL_DATA_SCHEMA = f"""
Entity:
- `id`: Unique integer identifier (0, 1, 2…).
- `name`: Name of the entity.
- `label`: {pipe_delimited_union(AgreementStructureEntityLabel)}.

Relationship: Connection from a source entity to a target entity.
- `source_id`: The `id` of the subject entity.
- `link`: {pipe_delimited_union(AgreementStructureRelationshipType)}.
- `target_id`: The `id` of the object entity.
"""

AGREEMENT_STRUCTURAL_OPEN_INSTRUCTIONS = """
- Extract every distinct entity that matches an allowed `label`.
- Extract every distinct relationship representing a structural connection (hierarchical organization) between these entities:
 - You must be comprehensive and highly granular. If multiple distinct relationships exist between the same pair of entities, create a separate entry for each.
"""

extract_knowledge_graph(
 AGREEMENT_STRUCTURAL_DATA_SCHEMA,
 AGREEMENT_STRUCTURAL_OPEN_INSTRUCTIONS,
 source,
 model,
 domain="Agreement Structure",
)

Input data (en_pharma_dev_agreement)

-------------------------- Request / gemini-3.5-flash --------------------------
Input tokens : 51,351
Output tokens : 9,108
----------------------------- Knowledge Graph Info -----------------------------
Entities : 314
Relationships : 513
--------------------------------------------------------------------------------

👁 png

💡 If you extract hundreds of entities from a massive document, your graph will quickly turn into an unreadable hairball. For larger datasets, you'll want to export your nodes and edges to a dedicated graph database, which typically comes with its own visualization and exploration tools.


🏁 Conclusion

We successfully extracted data and built knowledge graphs from documents by following these steps:

  • Prototyping with open prompts to develop an intuition for the model's natural strengths
  • Crafting increasingly specific prompts using a tabular-extraction strategy
  • Structuring our inputs to move towards production-ready and generalizable code
  • Structuring and optimizing our outputs for faster and cheaper generation
  • Adding data visualization for easier interpretation of responses and smoother iterations
  • Conducting more tests, iterating, and enriching the extracted data

These principles apply to many other data-extraction domains and will allow you to solve your own complex problems. Have fun and happy solving!


➕ More!