VOOZH about

URL: https://huggingface.co/datasets/scholarweave/arxiv-latex

⇱ scholarweave/arxiv-latex · Datasets at Hugging Face


id
string
yymm_id
string
submitter
string
authors
string
title
string
comments
string
journal-ref
string
doi
string
report-no
string
categories
string
license
string
abstract
string
versions
list
update_date
string
authors_parsed
list
latex
large_string
hep-lat/9107001
9107.001
Urs Heller
U.M. Heller, H. Neuberger and P. Vranas
How to Put a Heavier Higgs on the Lattice
null
Phys.Lett. B283 (1992) 335-340
10.1016/0370-2693(92)90028-3
null
hep-lat
null
Lattice work, exploring the Higgs mass triviality bound, seems to indicate that a strongly interacting scalar sector in the minimal standard model cannot exist while low energy QCD phenomenology seems to indicate that it could. We attack this puzzle using the 1/N expansion and discover a simple criterion for selectin...
[ { "version": "v1", "created": "Wed, 8 Apr 1992 22:58:34 GMT" } ]
2009-10-22
[ [ "Heller", "U. M.", "" ], [ "Neuberger", "H.", "" ], [ "Vranas", "P.", "" ] ]
================================================ FILE: 9107001.tex ================================================ % FSU-SCRI-91-94 submitted to Phys. Rev. Letters % FIG1.EPS renamed to SCRI91-94fig1.eps and archived % 12-JUL-1991 RUN THRU SPELL % % Instructions for tex-ing % Get preprint91.format, 14pt.font...
hep-lat/9107002
9107.002
Bernd Berg
Nelsons A. Alves, Bernd Al. Berg and Sergiu Sanielevici
Spectral Density Study of the SU(3) Deconfining Phase Transition
null
Nucl.Phys. B376 (1992) 218-252
10.1016/0550-3213(92)90075-M
null
hep-lat
null
We present spectral density reweighting techniques adapted to the analysis of a time series of data with a continuous range of allowed values. In a first application we analyze action and Polyakov line data from a Monte Carlo simulation on $L_t L^3 (L_t=2,4)$ lattices for the SU(3) deconfining phase transition. We ca...
[ { "version": "v1", "created": "Wed, 8 Apr 1992 22:58:34 GMT" } ]
2009-10-22
[ [ "Alves", "Nelsons A.", "" ], [ "Berg", "Bernd Al.", "" ], [ "Sanielevici", "Sergiu", "" ] ]
================================================ FILE: hep-lat9107002.tex ================================================ %FSU-SCRI-91-93 preprint Submitted to Nuclear Physics B % % % %%%% arXiv admin additions to make this compile 12/2010 \input harvmac \def\heading{\newsec} \def\subheading{\subsec} \newdimen\cbox...
hep-th/9108001
9108.001
Jim Horne
James H. Horne and Gary T. Horowitz
Exact Black String Solutions in Three Dimensions
17 pages
Nucl.Phys. B368 (1992) 444-462
10.1016/0550-3213(92)90536-K
null
hep-th
null
" A family of exact conformal field theories is constructed which describe\ncharged black strings i(...TRUNCATED)
[ { "version": "v1", "created": "Wed, 14 Aug 1991 22:25:19 GMT" } ]
2009-10-22
[ [ "Horne", "James H.", "" ], [ "Horowitz", "Gary T.", "" ] ]
"================================================\nFILE: 9108001.tex\n==============================(...TRUNCATED)
hep-th/9108002
9108.002
null
A. Mikovic
Hamiltonian construction of W-gravity actions
9 pages
Phys.Lett.B278:51-55,1992
10.1016/0370-2693(92)90710-L
null
hep-th
null
" We show that all W-gravity actions can be easilly constructed and understood\nfrom the point of v(...TRUNCATED)
[ { "version": "v1", "created": "Thu, 15 Aug 1991 16:35:28 GMT" } ]
2009-01-16
[ [ "Mikovic", "A.", "" ] ]
"================================================\nFILE: hep-th9108002.tex\n========================(...TRUNCATED)
hep-th/9108003
9108.003
Katri Huitu
Katri Huitu and Dennis Nemeschansky
Supersymmetric Gelfand-Dickey Algebra
13 pages
Mod. Phys. Lett. A6 (1991) 3179-3190
10.1142/S0217732391003675
null
hep-th
null
" We study the classical version of supersymmetric $W$-algebras. Using the\nsecond Gelfand-Dickey H(...TRUNCATED)
[ { "version": "v1", "created": "Thu, 15 Aug 1991 17:12:38 GMT" } ]
2015-06-26
[ [ "Huitu", "Katri", "" ], [ "Nemeschansky", "Dennis", "" ] ]
"================================================\nFILE: hep-th9108003.tex\n========================(...TRUNCATED)
hep-th/9108004
9108.004
null
Edward Witten
Ground Ring Of Two Dimensional String Theory
null
Nucl.Phys.B373:187-213,1992
10.1016/0550-3213(92)90454-J
null
hep-th
null
" String theories with two dimensional space-time target spaces are\ncharacterized by the existence(...TRUNCATED)
[ { "version": "v1", "created": "Fri, 16 Aug 1991 19:18:00 GMT" } ]
2010-04-07
[ [ "Witten", "Edward", "" ] ]
"================================================\nFILE: hep-th9108004.tex\n========================(...TRUNCATED)
hep-th/9108005
9108.005
Kenneth A. Intriligator
Kenneth Intriligator
Fusion Residues
16 pages
Mod.Phys.Lett. A6 (1991) 3543-3556
10.1142/S0217732391004097
null
hep-th
null
" We discuss when and how the Verlinde dimensions of a rational conformal field\ntheory can be expr(...TRUNCATED)
[ { "version": "v1", "created": "Mon, 19 Aug 1991 14:39:33 GMT" } ]
2015-06-26
[ [ "Intriligator", "Kenneth", "" ] ]
"================================================\nFILE: hep-th9108005.tex\n========================(...TRUNCATED)
hep-th/9108006
9108.006
null
Hirosi Ooguri and Naoki Sasakura
Discrete and Continuum Approaches to Three-Dimensional Quantum Gravity
14 pages
Mod.Phys.Lett.A6:3591-3600,1991
10.1142/S0217732391004140
null
hep-th
null
" It is shown that, in the three-dimensional lattice gravity defined by Ponzano\nand Regge, the spa(...TRUNCATED)
[{"version":"v1","created":"Tue, 20 Aug 1991 06:50:29 GMT"},{"version":"v2","created":"Fri, 6 Sep 19(...TRUNCATED)
2009-09-17
[ [ "Ooguri", "Hirosi", "" ], [ "Sasakura", "Naoki", "" ] ]
"================================================\nFILE: hep-th9108006.tex\n========================(...TRUNCATED)
hep-th/9108007
9108.007
Andre LeClair
A. LeCLair and F. Smirnov
Infinite Quantum Group Symmetry of Fields in Massive 2D Quantum Field Theory
29 pages
Int. J. Mod. Phys. A7 (1992) 2997-3022
10.1142/S0217751X92001332
null
hep-th
null
" Starting from a given S-matrix of an integrable quantum field theory in $1+1$\ndimensions, and kn(...TRUNCATED)
[ { "version": "v1", "created": "Tue, 20 Aug 1991 19:52:20 GMT" } ]
2015-06-26
[ [ "LeCLair", "A.", "" ], [ "Smirnov", "F.", "" ] ]
"================================================\nFILE: hep-th9108007.tex\n========================(...TRUNCATED)
hep-th/9108008
9108.008
null
J. Sonnenschein and S. Yankielowicz
Novel Symmetries of Topological Conformal Field theories
26 pages
null
null
null
hep-th
null
" We show that various actions of topological conformal theories that were\nsuggested recentely are(...TRUNCATED)
[ { "version": "v1", "created": "Tue, 20 Aug 1991 23:17:00 GMT" } ]
2007-05-23
[ [ "Sonnenschein", "J.", "" ], [ "Yankielowicz", "S.", "" ] ]
"================================================\nFILE: hep-th9108008.tex\n========================(...TRUNCATED)
End of preview. Expand in Data Studio

arXiv LaTeX Source Dataset

This dataset provides the entire corpus of arXiv's LaTeX source files, pre-parsed, formatted, and aligned with official metadata in ready-to-query Parquet files.


Why I Built This

If you have ever tried to work with the complete history of arXiv papers at scale, you have likely run into two massive hurdles:

  1. Network Egress Costs: While arXiv does offer public bulk access to its source files via S3 (s3://arxiv), the bucket is configured as "requester-pays." If you attempt to download the full 5 TB corpus of over 3 million papers to any machine outside of the AWS us-east-1 (N. Virginia) region, you are hit with standard AWS egress fees. At $0.09 per GB, a single full download costs more than $450.
  2. Computational Friction: The raw S3 data is packaged as hundreds of nested .tar archives, each containing gzip payloads (.gz) of individual papers. Extracting these, parsing the inner LaTeX code, and matching the files with their JSON metadata snapshots is extremely CPU-heavy, requiring complex local pipeline architecture.

This dataset acts as an open mirror that solves both issues. The project ingests the S3 data inside us-east-1 (where data transfer is free), index and process the LaTeX source documents, align them directly to their metadata snapshot, and upload the finalized Parquet files here. Researchers and developers can download clean, structured data without worrying about network egress bills or spending days writing ingestion code.


Ingest Schedule & The Parquet Manifest

  • Update Cycle: I sync the latest publications and revisions from arXiv S3 once every month.
  • Manifest Tracking: To support crash-resilient resuming, validation, and incremental syncing, the project maintains a central XML manifest file: arxiv_parquet_manifest.xml. This manifest maps each Parquet partition file to its size, MD5 checksum, processed timestamp, range of paper IDs (first_item and last_item), and the list of raw S3 .tar files that were unpacked to generate it.

Dataset Schema

Every row represents a single paper with metadata and parsed LaTeX source contents:

Column Name Type Description
id string arXiv paper identifier (e.g. 0704.0001 or hep-th/9901001).
yymm_id string Normalized ID mapped to YYMM format for chronological sorting.
submitter string Name of the user who uploaded the paper.
authors string Raw authors string.
title string Title of the paper.
comments string Submitter comments or journal references.
journal-ref string Official journal publication reference (if published).
doi string Digital Object Identifier (DOI).
report-no string Report or document series numbers.
categories string Space-separated arXiv categories (e.g., cs.CL math.PR).
license string License under which the paper was published.
abstract string The paper's abstract.
versions list<struct> Struct list of versions with creation timestamps.
update_date string Date the paper record was last modified by arXiv.
authors_parsed list<list<string>> Split author names (structured by Last Name, First Name, suffix).
latex large_string The parsed, compiled LaTeX source code from the paper. All source files (.tex, .bib, .sty, etc.) are bundled into a single readable Markdown-style tree structure.

Curation & Licensing

This dataset mirrors data provided under arXiv's Terms of Use. The copyright and licenses of individual paper contents are retained by their respective authors, and correspond to the license identifier specified in the license column.

Downloads last month
6,426