Extract Plain Text from Medium Posts for RAG and Search Indexes
HTML embeds are for humans; plain text is for chunking, embeddings, and summarization. One call should return body text without nav, clap bars, or script tags.
Tool outcome:
ingest-medium-article.ts→ chunked documents in your vector DB.
Pipeline
- Discover ids via user feed or search.
-
GET /article/{id}/content→ plain text. - Optional:
GET /article/{id}for title, tags, author metadata. - Chunk → embed → upsert vector store.
- Query in your chat UI or internal search.
Ingest script
const API = 'https://api.zenndra.com';
const headers = { Authorization: `Bearer ${process.env.ZENNDRA_API_KEY}` };
export async function fetchArticleText(articleId) {
const [contentRes, metaRes] = await Promise.all([
fetch(`${API}/article/${articleId}/content`, { headers }),
fetch(`${API}/article/${articleId}`, { headers }),
]);
const { content } = await contentRes.json();
const meta = await metaRes.json();
return {
id: articleId,
title: meta.title,
tags: meta.tags,
text: content,
};
}
export function chunkText(text, { size = 800, overlap = 100 } = {}) {
const words = text.split(/\s+/);
const chunks = [];
for (let i = 0; i < words.length; i += size - overlap) {
chunks.push(words.slice(i, i + size).join(''));
}
return chunks.filter(Boolean);
}
Wire chunkText to OpenAI embeddings, Ollama, or your host’s model—swap the vector client, keep the ingest shape.
Chunking tips
- Include title + tags in the embedding preamble for better retrieval.
- Store
article_idandchunk_indexin metadata for citations. - Deduplicate re-ingest with content hash if posts are edited rarely.
Compliance (non-optional)
- Respect Medium’s Terms of Service and author rights.
- Many teams only index their own posts or licensed partners.
- Do not expose paywalled or member-only content through public bots without permission.
For human-readable syndication, see embed articles—different threat model than LLM training.
Keywords
medium plain text api, medium rag pipeline, medium embeddings, medium article content extraction, llm medium.
Further reading
For further actions, you may consider blocking this person and/or reporting abuse
