PDF Text Extractor

Pricing

Pay per usage

PDF Text Extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Jiří Moravčík

Jiří Moravčík

Maintained by Community

Actor stats

Bookmarked

1.1K

Total users

Monthly active users

a year ago

Last modified

PDF Text Extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Input

URLs - URLs of the PDF files you want to extract the text from.
Chunk size - the maximum size of a single chunk of text
Chunk overlap - how many characters will overlap between neighbouring chunks of text

Output

Each item will contain the URL of the source PDF, index that identifies the position in the extracted text, and lastly, the extracted text.

Sample output

[{
"url":"https://arxiv.org/pdf/2307.12856.pdf",
"index":0,
"text":"Preprint\nA REAL-WORLD WEBAGENT WITH PLANNING,\nLONG CONTEXT UNDERSTANDING, AND\nPROGRAM SYNTHESIS\nIzzeddin Gur1∗ Hiroki Furuta1,2∗† Austin Huang1 Mustafa Safdari1 Yutaka Matsuo2\nDouglas Eck1 Aleksandra Faust1\n1Google DeepMind, 2The University of Tokyo\nizzeddin@google.com, furuta@weblab.t.u-tokyo.ac.jp\nABSTRACT\nPre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the\nperformance on real-world websites has still suffered from (1) open domainness,\n(2) limited context length, and (3) lack of inductive bias on HTML. We introduce\nWebAgent, an LLM-driven agent that learns from self-experience to complete tasks\non real websites following natural language instructions. WebAgent plans ahead by\ndecomposing instructions into canonical sub-instructions, summarizes long HTML\ndocuments into task-relevant snippets, and acts on websites via Python programs"
},
{
"url":"https://arxiv.org/pdf/2307.12856.pdf",
"index":1,
"text":"generated from those. We design WebAgent with Flan-U-PaLM, for grounded code\ngeneration, and HTML-T5, new pre-trained LLMs for long HTML documents\nusing local and global attention mechanisms and a mixture of long-span denoising\nobjectives, for planning and summarization. We empirically demonstrate that our\nmodular recipe improves the success on real websites by over 50%, and that HTMLT5 is the best model to solve various HTML understanding tasks; achieving 18.7%\nhigher success rate than the prior method on MiniWoB web automation benchmark,\nand SoTA performance on Mind2Web, an offline task planning evaluation.\n1 INTRODUCTION\nLarge language models (LLM) (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023) can\nsolve variety of natural language tasks, such as arithmetic, commonsense, logical reasoning, question\nanswering, text generation (Brown et al., 2020; Kojima et al., 2022; Wei et al., 2022), and even"
},
{
"url":"https://arxiv.org/pdf/2307.12856.pdf",
"index":2,
"text":"interactive decision making tasks (Ahn et al., 2022; Yao et al., 2022b). Recently, LLMs have also\ndemonstrated success in autonomous web navigation, where the agents control computers or browse\nthe internet to satisfy the given natural language instructions through the sequence of computer\nactions, by leveraging the capability of HTML comprehension and multi-step reasoning (Furuta et al.,\n2023; Gur et al., 2022; Kim et al., 2023).\nHowever, web automation on real-world websites has still suffered from (1) the lack of pre-defined\naction space, (2) much longer HTML observations than simulators, and (3) the absence of domain\nknowledge for HTML in LLMs (Figure 1). Considering the open-ended real-world websites and the\ncomplexity of instructions, defining appropriate action space in advance is challenging. In addition,\nalthough several works have argued that recent LLMs with instruction-finetuning or reinforcement"
}]

How to use PDF Text Extractor

Follow this tutorial to learn how to use PDF Text Extractor and combine it with LangChain to build an intelligent QA system that can extract answers from PDF documents.

PDF Text Extractor

automation-lab/pdf-text-extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

👁 User avatar

Stas Persiianenko

👁 PDF Scraper avatar

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

👁 User avatar

Onidivo Technologies

512

👁 PDF Extractor 2.0 avatar

PDF Extractor 2.0

jupri/pdf-extractor-2-0

💫 Extract PDF Document Contents including Metadata, Images, Pages, Tables, Attachments, etc.

👁 User avatar

cat

173

👁 GIF Scroll Animation avatar

GIF Scroll Animation

glenn/gif-scroll-animation

Free tool to automatically create an animated GIF of any scrolling web page. Useful for testing UX, showcasing your work, and capturing any website as a GIF, including clickable elements and animations. Includes settings to adjust speed, wait before scrolling, slow down on-page animations, and more.

👁 User avatar

Glenn Goossens

5.5K

2.0

👁 11880.com Business Directory Scraper avatar

11880.com Business Directory Scraper

santamaria-automations/11880-de-scraper

Scrape business listings from 11880.com, one of Germany's leading business directories. Extract company names, addresses, phone numbers, ratings, reviews, opening hours, and more. Supports keyword and location-based search with pagination.

👁 User avatar

Ale

👁 Reddit Community Posts Scraper Pro avatar

Reddit Community Posts Scraper Pro

getdataforme/reddit-community-posts-actor

Reddit community Posts Scraper Pro is developed and well tested scraper that extracts community posts detail information posted in reddit.com. Feel free to use and make best use of this scraper to meet your need and be on top of your competitors.

👁 User avatar

GetDataForMe

118

1.0

👁 Extended GPT Scraper avatar

Extended GPT Scraper

drobnikj/extended-gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

👁 User avatar

Jakub Drobník

1.6K

4.8

👁 Pdf Text Extractor Pro avatar

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

👁 User avatar

codemaster devops

5.0

👁 🔥 Web Traffic Generator | 🚀 WebRocket 🚀 avatar

🔥 Web Traffic Generator | 🚀 WebRocket 🚀

bebity/web-traffic-generator

🚀💥 Introducing WebRocket! 💥 Supercharge your website 📈, deep crawling 🕸️, and robust error handling 🤖. Blast off with start URLs 🚀, choose simultaneous visitors 🧑🏻‍🤝‍🧑🏻, and set visit numbers #️⃣. Customize the stay duration ⌛, pick device types 📱🖥️📟, and use residential proxies 🌍🏠

👁 User avatar

Bebity

15K

4.7

👁 Google Ads Transparency Scraper - Competitor Ads avatar

Google Ads Transparency Scraper - Competitor Ads

logiover/google-ads-transparency-scraper

Google Ads Transparency Center API alternative: scrape competitor ads to CSV/JSON. Impressions, spend & regions export, no login or API key.

👁 User avatar

Logiover

👁 Blog article image

The definitive guide to text scraping

URL: https://apify.com/jirimoravcik/pdf-text-extractor

⇱ PDF Text Extractor · Apify

PDF Text Extractor

PDF Text Extractor

Input

Output

Sample output

How to use PDF Text Extractor

You might also like

PDF Text Extractor

PDF Scraper

PDF Extractor 2.0

GIF Scroll Animation

11880.com Business Directory Scraper

Reddit Community Posts Scraper Pro

Extended GPT Scraper

Pdf Text Extractor Pro

🔥 Web Traffic Generator | 🚀 WebRocket 🚀

Google Ads Transparency Scraper - Competitor Ads

Related articles