VOOZH about

URL: https://apify.com/parseforge/uniprot-scraper

โ‡ฑ UniProt Protein Knowledgebase Scraper - 250M+ Entries ยท Apify


๐Ÿ‘ UniProt Protein Sequence & Annotation Scraper avatar

UniProt Protein Sequence & Annotation Scraper

Pricing

from $28.12 / 1,000 results

Go to Apify Store

UniProt Protein Sequence & Annotation Scraper

Export UniProt Knowledgebase entries โ€” search Swiss-Prot by organism, keyword, gene, or any UniProt query, or fetch a single accession. Returns names, genes, organism, sequence length & molecular weight, keywords, comments, features, and PDB/RefSeq/Ensembl/KEGG cross-refs.

Pricing

from $28.12 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a month ago

Last modified

Share

๐Ÿ‘ ParseForge Banner

๐Ÿงฌ UniProt Protein Sequence & Annotation Scraper

๐Ÿš€ Export UniProt Knowledgebase entries in seconds. Query Swiss-Prot and TrEMBL by organism, gene, keyword, subcellular location, length range, or any UniProt field, or fetch a single accession with full annotations. No API key, no SPARQL, no XML parsing.

๐Ÿ•’ Last updated: 2026-05-13 ยท ๐Ÿ“Š 25 fields per entry ยท ๐Ÿงฌ 250M+ UniProt entries ยท ๐ŸŒ every kingdom of life

The UniProt Protein Scraper queries the official UniProt REST API and returns standardized protein records from the world's largest protein-sequence knowledgebase. Each entry carries the primary accession, UniProtKB ID, entry type (reviewed Swiss-Prot vs unreviewed TrEMBL), protein name, alternative names, gene names, organism (scientific + common + taxon ID + lineage), evidence level, annotation score, sequence length, molecular weight, CRC64 / MD5 sequence hashes, keywords (with categories), curated comments (function, subunit, subcellular location, etc.), structural features, reference counts, last-update date, entry version, and the canonical UniProt URL.

UniProt is maintained jointly by EMBL-EBI, SIB, and PIR and is the de facto reference for protein biology in research, pharma, and bioinformatics. Coverage spans 250 million+ entries across 2.7 million+ species in TrEMBL, with ~570,000 manually curated entries in Swiss-Prot. This Actor flattens UniProt's nested JSON into rows that drop into pandas, R, or any warehouse.

๐ŸŽฏ Target Audience๐Ÿ’ก Primary Use Cases
Bioinformatics teams, computational biologists, pharma research, structural biologists, drug-discovery startups, science journalistsProteome exports, gene-to-protein mapping, target dossier builds, organism-level annotation, sequence + feature retrieval, cross-database joining

๐Ÿ“‹ What the UniProt Scraper does

Two lookup modes in one Actor:

  • ๐Ÿ” Query mode. Pass any UniProt query (reviewed:true AND organism_id:9606, keyword:KW-0181, gene:BRCA1, cc_subcellular_location:nucleus, existence:1, taxonomy_id:10090 AND length:[100 TO 500]).
  • ๐Ÿ†” Accession mode. Set accession (e.g. P00533) for a single full-entry pull. Skips the search query entirely.

Each record carries identifiers (primary accession, UniProtKB ID, entry type), names (protein name, alternative names, gene names), taxonomy (scientific + common organism, taxon ID, lineage), evidence (protein existence, annotation score), sequence facts (length, molecular weight, CRC64, MD5, plus optional full sequence string), curated annotations (keywords, comments, features), reference + feature counts, last-updated date, version, and the canonical UniProt URL.

๐Ÿ’ก Why it matters: UniProt's REST API is rich but verbose. Researchers and engineering teams spend days writing parsers for keywords, comments, and features. This Actor flattens the response into 25 spreadsheet-ready fields so target dossiers, comparative proteomics, and dataset prep land in one query.


๐ŸŽฌ Full Demo

๐Ÿšง Coming soon: a 3-minute walkthrough showing a human proteome pull, gene lookup, and accession fetch.


โš™๏ธ Input

InputTypeDefaultBehavior
querystring"reviewed:true AND organism_id:9606"UniProt query syntax. Supports reviewed:, organism_id:, taxonomy_id:, gene:, keyword:, cc_subcellular_location:, existence:, length:[X TO Y], and more. Ignored when accession is set.
accessionstring""Single UniProt accession (e.g. P00533). Bypasses the search query when set.
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
fetchSequencebooleanfalseWhen true, embeds the full amino-acid sequence string in every record. Sequence length and molecular weight are always returned.
pageSizeinteger500Entries per API request. UniProt hard max is 500.

Example: every reviewed human Swiss-Prot entry.

{
"query":"reviewed:true AND organism_id:9606",
"maxItems":1000,
"pageSize":500
}

Example: single accession, full sequence included.

{
"accession":"P00533",
"fetchSequence":true
}

โš ๏ธ Good to Know: the accession field is for a single entry. To resolve a list of accessions, use the query syntax: accession:P00533 OR accession:P04637. Use fetchSequence: false (default) when you do not need the raw amino-acid string. Sequence length and molecular weight are always returned regardless.


๐Ÿ“Š Output

Each entry carries 25 fields. Download as CSV, Excel, JSON, or XML.

๐Ÿงพ Schema

FieldTypeExample
๐Ÿ†” primaryAccessionstring"A0A0C5B5G6"
๐Ÿท๏ธ uniProtkbIdstring"MOTSC_HUMAN"
๐Ÿ“š entryTypestring"UniProtKB reviewed (Swiss-Prot)"
๐Ÿงฌ proteinNamestring"Mitochondrial-derived peptide MOTS-c"
๐Ÿ“ alternativeNamesstring[]["Mitochondrial open reading frame of the 12S rRNA-c"]
๐Ÿงซ geneNamesstring[]["MT-RNR1"]
๐Ÿฆ  organismScientificstring"Homo sapiens"
๐Ÿ‘ค organismCommonstring"Human"
๐Ÿ†” taxonIdnumber9606
๐ŸŒณ organismLineagestring[]["Eukaryota","Metazoa","Chordata",...]
๐Ÿงช proteinExistencestring"1: Evidence at protein level"
โญ annotationScorenumber5
๐Ÿ“ sequenceLengthnumber16
โš–๏ธ sequenceMolWeightnumber2175
๐Ÿ” sequenceCrc64string"361DE748426DD505"
๐Ÿ” sequenceMd5string"AE72B6C4E87692429C0D558B92BD7B3D"
๐Ÿท๏ธ keywordsobject[][{ "id": "KW-0238", "category": "Molecular function", "name": "DNA-binding" }]
๐Ÿ’ฌ commentsobject[][{ "type": "FUNCTION", "text": "Regulates insulin sensitivity ..." }]
๐Ÿงฉ featuresobject[][{ "type": "Chain", "description": "MOTS-c", "start": 1, "end": 16 }]
๐Ÿ“– referenceCountnumber17
๐Ÿงฑ featureCountnumber6
๐Ÿ“… lastUpdateddate"2026-01-28"
๐Ÿ”ข entryVersionnumber30
๐Ÿ”— urlstring"https://www.uniprot.org/uniprotkb/A0A0C5B5G6/entry"
๐Ÿ•’ scrapedAtISO 8601"2026-05-13T22:25:18.386Z"

๐Ÿ“ฆ Sample record


โœจ Why choose this Actor

Capability
๐ŸงฌAuthoritative knowledgebase. Pulls directly from the official UniProt REST API.
๐Ÿ”Full query syntax. Every UniProt search field works: organism, gene, keyword, location, length range, evidence, taxonomy.
๐Ÿ†”Accession fast-path. Set accession: to pull one entry without writing a query.
๐Ÿ“Sequence facts built in. Length and molecular weight always returned. Full sequence string available on demand.
๐Ÿท๏ธCurated annotations exposed. Keywords, comments, and features come through as structured arrays.
๐ŸšซNo API key. UniProt is a free public service.
๐Ÿ”Always fresh. Reflects the current UniProt release.

๐Ÿ“Š UniProt entries are referenced in nearly every modern paper on protein biology, drug discovery, and structural biology.


๐Ÿ“ˆ How it compares to alternatives

ApproachCostCoverageRefreshFormatSetup
โญ UniProt Scraper (this Actor)$5 free credit, then pay-per-useUniProtKB (Swiss-Prot + TrEMBL)Live per runFlat JSON / CSVโšก 2 min
Direct REST API callsFreeSameLiveNested JSON๐Ÿข Hours
Full release FASTA + XML downloadFreeFull UniProt8-week releaseMassive flatfiles๐Ÿข Days
Commercial bioinformatics platform$$$Curated subsetReal-timeWeb UI / APIโณ Vendor onboarding

Pick this Actor when you want UniProt records in a flat table without writing a client or downloading the release.


๐Ÿš€ How to use

  1. ๐Ÿ“ Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. ๐ŸŒ Open the Actor. Go to the UniProt Protein Scraper page on the Apify Store.
  3. ๐ŸŽฏ Set input. Pick a query (reviewed:true AND organism_id:9606 is a great starter) or an accession.
  4. ๐Ÿš€ Run it. Click Start and let the Actor walk the UniProt API.
  5. ๐Ÿ“ฅ Download. Grab results in the Dataset tab as CSV, Excel, JSON, or XML.

โฑ๏ธ Total time from signup to a downloaded proteome slice: 3-5 minutes. No coding required.


๐Ÿ’ผ Business use cases

๐Ÿงช Drug Discovery & Pharma

  • Target dossier builds for new programs
  • Cross-organism homolog comparisons
  • Subcellular location filters for druggability
  • Evidence-level scoring for prioritization

๐Ÿงฌ Bioinformatics & Genomics

  • Gene-to-protein lookups across organisms
  • Proteome exports for comparative analysis
  • Annotation enrichment for variant calling
  • Keyword and feature-based cohort building

๐Ÿ”ฌ Structural Biology

  • Length and molecular-weight filters for crystallography candidates
  • Feature-table mining for domain boundaries
  • Sequence hash joins to PDB or AlphaFold IDs
  • Reference-count signals for popular targets

๐Ÿค– LLM & Bio AI

  • Ground LLM responses in UniProt-authoritative data
  • Build RAG indexes for protein chatbots
  • Training data for sequence-attribute models
  • Validation layers for bio AI agents

๐Ÿ”Œ Automating UniProt Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • ๐ŸŸข Node.js. Install the apify-client NPM package.
  • ๐Ÿ Python. Use the apify-client PyPI package.
  • ๐Ÿ“š See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. UniProt has an eight-week release cycle. Schedule a refresh on the same cadence to stay current.


๐ŸŒŸ Beyond business use cases

UniProt data feeds far more than commercial pharma. The same structured records support research, education, and open-science work.

๐ŸŽ“ Research and academia

  • Reproducible proteome datasets for papers
  • Coursework on protein annotation and biocuration
  • Comparative-genomics theses with structured features
  • Open-data benchmarks for sequence-based ML

๐ŸŽจ Personal and creative

  • Hobbyist bioinformatics portfolio projects
  • Sci-comm visualizations of protein families
  • Personal target tracker for citizen scientists
  • Indie tools for amateur synthetic biology

๐Ÿค Non-profit and civic

  • Pandemic preparedness datasets keyed to UniProt
  • Public-health reports on pathogen proteomes
  • Open-source vaccine candidate research
  • Civic transparency on bio-research outputs

๐Ÿงช Experimentation

  • Train sequence-attribute ML classifiers
  • Prototype agents that build target dossiers
  • Test bio chatbot grounding against real records
  • Benchmark protein-NER models

๐Ÿค– Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


โ“ Frequently Asked Questions

๐Ÿงฉ How does it work?

Either supply a UniProt query (reviewed:true AND organism_id:9606) or an accession (P00533), then click Start. The Actor pages through the UniProt REST API, flattens nested fields, and emits a row per entry with 25 columns including keywords, comments, and features.

๐Ÿ” What query syntax can I use?

Everything UniProt supports in its own search bar. Common fields: reviewed:, organism_id:, taxonomy_id:, gene:, keyword:, cc_subcellular_location:, existence:, length:[X TO Y], accession:, plus boolean AND/OR/NOT. See the UniProt query fields docs for the full list.

๐Ÿ†” How do I look up a single accession?

Set the accession field (e.g. P00533). It bypasses the query and pulls the full entry directly.

๐Ÿงฌ How do I look up many accessions at once?

Use the query syntax with OR: accession:P00533 OR accession:P04637 OR accession:Q9Y6K8.

๐Ÿ“ Does it include the full sequence string?

Only when fetchSequence: true. Sequence length and molecular weight are always returned. Skip the full string for big proteomes to keep dataset sizes manageable.

๐Ÿ” How fresh is the data?

UniProt releases every eight weeks. Every run hits the live API, so output reflects the current release.

๐Ÿ“š What is the difference between Swiss-Prot and TrEMBL?

Swiss-Prot is manually curated (reviewed:true, ~570K entries). TrEMBL is automatically annotated (reviewed:false, hundreds of millions of entries). Pick the slice your work needs.

๐Ÿšซ Do I need an API key?

No. The UniProt REST API is free and public.

โฐ Can I schedule recurring runs?

Yes. Use Apify Schedules to refresh on the UniProt release cadence and pipe results into your pipeline.

โš–๏ธ Is this data legal to use?

Yes. UniProt is released under CC BY 4.0. Attribute UniProt in any downstream publication or product, as their license requires.

๐Ÿ’ณ Do I need a paid Apify plan?

No. The free plan covers small runs (10 records). A paid plan unlocks higher limits and scheduling.

๐Ÿ†˜ What if I need help?

Reach out via the contact form below to request a custom protein workflow.


๐Ÿ”Œ Integrate with any app

UniProt Protein Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step research workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get release notifications in your channels
  • Airbyte - Pipe protein records into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh UniProt entries into your bio pipeline or alert your team in Slack.


๐Ÿ”— Recommended Actors

๐Ÿ’ก Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


๐Ÿ†˜ Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


โš ๏ธ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by EMBL-EBI, the SIB Swiss Institute of Bioinformatics, the Protein Information Resource (PIR), the UniProt Consortium, or any of their funding agencies. All trademarks mentioned are the property of their respective owners. Only publicly available UniProtKB data is collected. Please cite UniProt as required by their CC BY 4.0 license.

You might also like

UniProt Protein Scraper

parseforge/uniprot-protein-scraper

Query the UniProt knowledgebase with any free text search to retrieve protein entries with accession identifiers, names, gene symbols, organism, sequence length, and functional annotations. Useful for proteomics research, bioinformatics pipelines, and structural biology cross referencing.

UniProt Proteins Scraper

parseforge/uniprot-proteins-scraper

Query UniProt with its native syntax such as reviewed=true or organism_id 9606. Returns accession, protein name, organism, gene names, sequence, length, function, keywords, EC numbers, and reference count. Restrict to Swiss-Prot reviewed entries. Useful for proteomics and drug discovery.

HGNC Gene Symbols Scraper

parseforge/hgnc-gene-symbols-scraper

Query the HUGO Gene Nomenclature Committee database for approved human gene symbols, names, aliases, chromosomal location, gene family, RefSeq, Ensembl, OMIM, UniProt, and external links. Export to JSON, CSV, or Excel for bioinformatics, genomics research, and pharmaceutical pipelines.

EBI Proteins API Scraper

parseforge/ebi-proteins-api-scraper

Tap the EMBL EBI Proteins API for curated protein entries filtered by protein name and organism. Returns accession identifiers, gene names, taxonomy, feature annotations, and sequence metadata. Useful for comparative genomics, interaction analysis, and protein function enrichment studies.

Ensembl Genomics Scraper (Genes, Variants, Sequences)

parseforge/ensembl-genomics-scraper

Query the Ensembl genome reference for 200+ species. Look up genes by symbol or stable ID, list features in a genomic region, fetch DNA sequence, or resolve human variants (rsIDs). Returns biotype, coordinates, transcript IDs, descriptions, and assembly metadata.

Ensembl Gene Lookup Scraper

parseforge/ensembl-gene-lookup-scraper

Resolve human gene symbols against the Ensembl REST API to fetch stable gene identifiers, chromosome location, strand, biotype, and description. Useful for variant annotation, RNA seq pipelines, and gene set enrichment workflows that need clean Ensembl mappings from a list of HGNC symbols.

ChEMBL Targets Scraper

parseforge/chembl-targets-scraper

Query the ChEMBL target catalog by ID, keyword, organism, or target type. Records include target ChEMBL ID, preferred name, organism, target type, gene symbol, tax ID, components with accession and description, and cross references. Useful for drug discovery research and target review.

NCBI Gene Database Scraper

parseforge/ncbi-eutils-gene-scraper

Query NCBI Gene through Entrez syntax such as BRCA1[gene] AND human[orgn]. Returns gene symbol, description, organism, chromosome, map location, summary, aliases, and designations. Useful for genomics pipelines, target discovery, and clinical research across model organisms.

NCBI Gene Lookup โ€” Genomics API for Pharma R&D

azureblue/ncbi-gene-scraper

Search NCBI Gene via E-utilities. Returns gene symbol, full name, chromosome location, map locus, aliases, OMIM ID, organism and functional summary.