VOOZH about

URL: https://apify.com/consummate_hickory/s3filetomarkdown

โ‡ฑ S3 to Markdown ยท Apify


Pricing

$10.00 / 1,000 extractions

Go to Apify Store

Transform S3 documents into perfect AI training data! Converts PDFs, Word, Excel, images, audio to clean Markdown that LLMs love. Uses Microsoft's markitdown engine. Ideal for RAG systems, AI agents, and machine learning pipelines.

Pricing

$10.00 / 1,000 extractions

Rating

5.0

(2)

Developer

๐Ÿ‘ Lorenzo Dalmazzo

Lorenzo Dalmazzo

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

2

Monthly active users

a year ago

Last modified

Share

S3 File to Markdown Converter

This Apify Actor downloads multiple files from Amazon S3 and converts them to Markdown using markitdown.

Features

  • Bulk Processing: Process multiple files in a single run for efficiency
  • Downloads files from S3 buckets
  • Converts various file formats to Markdown (PDF, Word, PowerPoint, Excel, Images, Audio, HTML, etc.)
  • Secure credential management via encrypted input fields
  • Robust Error Handling: Individual file failures don't stop the entire batch
  • Progress tracking and detailed logging
  • Pay-per-conversion: You only pay $0.01 for each successfully converted file

Input Configuration

The actor requires the following input parameters:

  • aws_access_key_id (required, secret): Your AWS access key ID for S3 access
  • aws_secret_access_key (required, secret): Your AWS secret access key for S3 access
  • s3_bucket (required): The name of the S3 bucket containing the files
  • s3_keys (required): Array of S3 object keys (paths) of the files to convert in the S3 bucket
  • aws_region (required): The AWS region where the S3 bucket is located

AWS Credentials

AWS credentials are provided directly in the actor input as encrypted secret fields. The credentials are automatically encrypted by Apify and only decrypted during actor execution for maximum security.

Pricing

This actor uses pay-per-conversion pricing:

  • ๐Ÿ’ฐ $0.01 per successfully converted file
  • โŒ No charge for failed conversions (missing files, conversion errors, etc.)
  • ๐Ÿš€ Cost-effective for batch processing - process many files efficiently
  • ๐Ÿ“Š Transparent billing - you can see exactly which files were charged in the logs (look for "charged $0.01" messages)

Example: If you process 100 files and 95 succeed, you pay $0.95 (only for the 95 successful conversions).

Example Input

{
"aws_access_key_id":"YOUR_AWS_ACCESS_KEY_ID",
"aws_secret_access_key":"YOUR_AWS_SECRET_ACCESS_KEY",
"s3_bucket":"my-documents-bucket",
"s3_keys":[
"documents/report.pdf",
"documents/invoice.docx",
"documents/presentation.pptx"
],
"aws_region":"us-west-2"
}

Note: The AWS credentials will appear as password fields in the Apify Console and will be automatically encrypted.

Output

The actor processes multiple files and saves one record per converted file to the dataset. Each record has the following structure:

  • s3_bucket: The source S3 bucket name
  • s3_key: The specific S3 object key that was converted
  • markdown_content: The converted Markdown content from that file
  • file_size_chars: The size of the Markdown content in characters

The output is displayed in a user-friendly table format in the Apify Console's Output tab, with one row per converted file.

Example Output

For the input with multiple files above, you would get multiple records:

{
"s3_bucket":"my-documents-bucket",
"s3_key":"documents/report.pdf",
"markdown_content":"# Report Title\n\nThis is the converted markdown content...",
"file_size_chars":1234
}
{
"s3_bucket":"my-documents-bucket",
"s3_key":"documents/invoice.docx",
"markdown_content":"# Invoice\n\nInvoice Number: 12345...",
"file_size_chars":856
}
{
"s3_bucket":"my-documents-bucket",
"s3_key":"documents/presentation.pptx",
"markdown_content":"# Presentation Title\n\n## Slide 1...",
"file_size_chars":2048
}

Supported File Formats

Thanks to markitdown, this actor supports:

  • PDF documents
  • Microsoft Office files (Word, PowerPoint, Excel)
  • Images (with OCR)
  • Audio files (with transcription)
  • HTML files
  • Text-based formats (CSV, JSON, XML)
  • ZIP archives
  • EPub files
  • And more!

Error Handling

The actor provides robust error handling for batch processing:

  • Batch Resilience: If one file fails, the actor continues processing other files
  • Detailed Logging: Each file's processing status is logged individually
  • No charges for failures: You're only charged for successfully converted files
  • Clear Error Messages: Specific error messages for common issues:
    • Missing AWS credentials
    • Invalid S3 bucket
    • Missing S3 objects (individual files are skipped, not charged)
    • Access denied errors (individual files are skipped, not charged)
    • File conversion failures (individual files are skipped, not charged)

Usage Example

from apify_client import ApifyClient
client = ApifyClient("your-api-token")
# Run the actor
run = client.actor("your-actor-id").call(run_input={
"aws_access_key_id":"YOUR_AWS_ACCESS_KEY_ID",
"aws_secret_access_key":"YOUR_AWS_SECRET_ACCESS_KEY",
"s3_bucket":"my-documents",
"s3_keys":["files/document.pdf","files/report.docx"],
"aws_region":"us-east-1"
})
# Get the markdown content
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
markdown_content = item["markdown_content"]
s3_key = item["s3_key"]
print(f"Converted {s3_key}: {len(markdown_content)} characters")
# Note: You'll be charged $0.01 for each successfully converted file
print(f"Total cost: ${run['stats']['itemsCount']*0.01:.2f}")

You might also like

Doc To Markdown MCP Server

abotapi/doc-to-markdown-mcp

An MCP server that converts documents to clean Markdown. Convert PDFs, Word docs, Excel spreadsheets, PowerPoints, HTML, images, and more to AI-friendly Markdown format.

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required โ€” fast and cheap.

๐Ÿ‘ User avatar

Daniel Dimitrov

4

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds โ€” perfect for AI training data, RAG pipelines, and content archiving.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

14

5.0

Markitdown Mcp Server

rector_labs/markitdown-mcp-server

Cloud-hosted MCP server converting 29+ document formats (PDF, DOCX, PPTX, images, audio) to AI-ready Markdown. Zero Python setup. Perfect for RAG pipelines and AI agents. Pay-per-use: $0.02/conversion. Built on Microsoft's Markitdown (82k+ โญ).

Convert To Markdown

datavault/convert-to-markdown

Convert to Markdown, converts documents, spreadsheets, images (OCR), audio (transcription), and web/data files into clean Markdown. It runs fully locally, requires no API keys, and is ideal for LLMs, docs, and archiving.

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.