id string | yymm_id string | submitter string | authors string | title string | comments string | journal-ref string | doi string | report-no string | categories string | license string | abstract string | versions list | update_date string | authors_parsed list | latex large_string |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hep-lat/9107001 | 9107.001 | Urs Heller | U.M. Heller, H. Neuberger and P. Vranas | How to Put a Heavier Higgs on the Lattice | null | Phys.Lett. B283 (1992) 335-340 | 10.1016/0370-2693(92)90028-3 | null | hep-lat | null | Lattice work, exploring the Higgs mass triviality bound, seems to indicate
that a strongly interacting scalar sector in the minimal standard model cannot
exist while low energy QCD phenomenology seems to indicate that it could. We
attack this puzzle using the 1/N expansion and discover a simple criterion for
selectin... | [
{
"version": "v1",
"created": "Wed, 8 Apr 1992 22:58:34 GMT"
}
] | 2009-10-22 | [
[
"Heller",
"U. M.",
""
],
[
"Neuberger",
"H.",
""
],
[
"Vranas",
"P.",
""
]
] | ================================================
FILE: 9107001.tex
================================================
% FSU-SCRI-91-94 submitted to Phys. Rev. Letters
% FIG1.EPS renamed to SCRI91-94fig1.eps and archived
% 12-JUL-1991 RUN THRU SPELL
%
% Instructions for tex-ing
% Get preprint91.format, 14pt.font... |
hep-lat/9107002 | 9107.002 | Bernd Berg | Nelsons A. Alves, Bernd Al. Berg and Sergiu Sanielevici | Spectral Density Study of the SU(3) Deconfining Phase Transition | null | Nucl.Phys. B376 (1992) 218-252 | 10.1016/0550-3213(92)90075-M | null | hep-lat | null | We present spectral density reweighting techniques adapted to the analysis of
a time series of data with a continuous range of allowed values. In a first
application we analyze action and Polyakov line data from a Monte Carlo
simulation on $L_t L^3 (L_t=2,4)$ lattices for the SU(3) deconfining phase
transition. We ca... | [
{
"version": "v1",
"created": "Wed, 8 Apr 1992 22:58:34 GMT"
}
] | 2009-10-22 | [
[
"Alves",
"Nelsons A.",
""
],
[
"Berg",
"Bernd Al.",
""
],
[
"Sanielevici",
"Sergiu",
""
]
] | ================================================
FILE: hep-lat9107002.tex
================================================
%FSU-SCRI-91-93 preprint Submitted to Nuclear Physics B
%
%
%
%%%% arXiv admin additions to make this compile 12/2010
\input harvmac
\def\heading{\newsec}
\def\subheading{\subsec}
\newdimen\cbox... |
hep-th/9108001 | 9108.001 | Jim Horne | James H. Horne and Gary T. Horowitz | Exact Black String Solutions in Three Dimensions | 17 pages | Nucl.Phys. B368 (1992) 444-462 | 10.1016/0550-3213(92)90536-K | null | hep-th | null | " A family of exact conformal field theories is constructed which describe\ncharged black strings i(...TRUNCATED) | [
{
"version": "v1",
"created": "Wed, 14 Aug 1991 22:25:19 GMT"
}
] | 2009-10-22 | [
[
"Horne",
"James H.",
""
],
[
"Horowitz",
"Gary T.",
""
]
] | "================================================\nFILE: 9108001.tex\n==============================(...TRUNCATED) |
hep-th/9108002 | 9108.002 | null | A. Mikovic | Hamiltonian construction of W-gravity actions | 9 pages | Phys.Lett.B278:51-55,1992 | 10.1016/0370-2693(92)90710-L | null | hep-th | null | " We show that all W-gravity actions can be easilly constructed and understood\nfrom the point of v(...TRUNCATED) | [
{
"version": "v1",
"created": "Thu, 15 Aug 1991 16:35:28 GMT"
}
] | 2009-01-16 | [
[
"Mikovic",
"A.",
""
]
] | "================================================\nFILE: hep-th9108002.tex\n========================(...TRUNCATED) |
hep-th/9108003 | 9108.003 | Katri Huitu | Katri Huitu and Dennis Nemeschansky | Supersymmetric Gelfand-Dickey Algebra | 13 pages | Mod. Phys. Lett. A6 (1991) 3179-3190 | 10.1142/S0217732391003675 | null | hep-th | null | " We study the classical version of supersymmetric $W$-algebras. Using the\nsecond Gelfand-Dickey H(...TRUNCATED) | [
{
"version": "v1",
"created": "Thu, 15 Aug 1991 17:12:38 GMT"
}
] | 2015-06-26 | [
[
"Huitu",
"Katri",
""
],
[
"Nemeschansky",
"Dennis",
""
]
] | "================================================\nFILE: hep-th9108003.tex\n========================(...TRUNCATED) |
hep-th/9108004 | 9108.004 | null | Edward Witten | Ground Ring Of Two Dimensional String Theory | null | Nucl.Phys.B373:187-213,1992 | 10.1016/0550-3213(92)90454-J | null | hep-th | null | " String theories with two dimensional space-time target spaces are\ncharacterized by the existence(...TRUNCATED) | [
{
"version": "v1",
"created": "Fri, 16 Aug 1991 19:18:00 GMT"
}
] | 2010-04-07 | [
[
"Witten",
"Edward",
""
]
] | "================================================\nFILE: hep-th9108004.tex\n========================(...TRUNCATED) |
hep-th/9108005 | 9108.005 | Kenneth A. Intriligator | Kenneth Intriligator | Fusion Residues | 16 pages | Mod.Phys.Lett. A6 (1991) 3543-3556 | 10.1142/S0217732391004097 | null | hep-th | null | " We discuss when and how the Verlinde dimensions of a rational conformal field\ntheory can be expr(...TRUNCATED) | [
{
"version": "v1",
"created": "Mon, 19 Aug 1991 14:39:33 GMT"
}
] | 2015-06-26 | [
[
"Intriligator",
"Kenneth",
""
]
] | "================================================\nFILE: hep-th9108005.tex\n========================(...TRUNCATED) |
hep-th/9108006 | 9108.006 | null | Hirosi Ooguri and Naoki Sasakura | Discrete and Continuum Approaches to Three-Dimensional Quantum Gravity | 14 pages | Mod.Phys.Lett.A6:3591-3600,1991 | 10.1142/S0217732391004140 | null | hep-th | null | " It is shown that, in the three-dimensional lattice gravity defined by Ponzano\nand Regge, the spa(...TRUNCATED) | [{"version":"v1","created":"Tue, 20 Aug 1991 06:50:29 GMT"},{"version":"v2","created":"Fri, 6 Sep 19(...TRUNCATED) | 2009-09-17 | [
[
"Ooguri",
"Hirosi",
""
],
[
"Sasakura",
"Naoki",
""
]
] | "================================================\nFILE: hep-th9108006.tex\n========================(...TRUNCATED) |
hep-th/9108007 | 9108.007 | Andre LeClair | A. LeCLair and F. Smirnov | Infinite Quantum Group Symmetry of Fields in Massive 2D Quantum Field
Theory | 29 pages | Int. J. Mod. Phys. A7 (1992) 2997-3022 | 10.1142/S0217751X92001332 | null | hep-th | null | " Starting from a given S-matrix of an integrable quantum field theory in $1+1$\ndimensions, and kn(...TRUNCATED) | [
{
"version": "v1",
"created": "Tue, 20 Aug 1991 19:52:20 GMT"
}
] | 2015-06-26 | [
[
"LeCLair",
"A.",
""
],
[
"Smirnov",
"F.",
""
]
] | "================================================\nFILE: hep-th9108007.tex\n========================(...TRUNCATED) |
hep-th/9108008 | 9108.008 | null | J. Sonnenschein and S. Yankielowicz | Novel Symmetries of Topological Conformal Field theories | 26 pages | null | null | null | hep-th | null | " We show that various actions of topological conformal theories that were\nsuggested recentely are(...TRUNCATED) | [
{
"version": "v1",
"created": "Tue, 20 Aug 1991 23:17:00 GMT"
}
] | 2007-05-23 | [
[
"Sonnenschein",
"J.",
""
],
[
"Yankielowicz",
"S.",
""
]
] | "================================================\nFILE: hep-th9108008.tex\n========================(...TRUNCATED) |
arXiv LaTeX Source Dataset
This dataset provides the entire corpus of arXiv's LaTeX source files, pre-parsed, formatted, and aligned with official metadata in ready-to-query Parquet files.
Why I Built This
If you have ever tried to work with the complete history of arXiv papers at scale, you have likely run into two massive hurdles:
- Network Egress Costs: While arXiv does offer public bulk access to its source files via S3 (
s3://arxiv), the bucket is configured as "requester-pays." If you attempt to download the full 5 TB corpus of over 3 million papers to any machine outside of the AWSus-east-1(N. Virginia) region, you are hit with standard AWS egress fees. At $0.09 per GB, a single full download costs more than $450. - Computational Friction: The raw S3 data is packaged as hundreds of nested
.tararchives, each containing gzip payloads (.gz) of individual papers. Extracting these, parsing the inner LaTeX code, and matching the files with their JSON metadata snapshots is extremely CPU-heavy, requiring complex local pipeline architecture.
This dataset acts as an open mirror that solves both issues. The project ingests the S3 data inside us-east-1 (where data transfer is free), index and process the LaTeX source documents, align them directly to their metadata snapshot, and upload the finalized Parquet files here. Researchers and developers can download clean, structured data without worrying about network egress bills or spending days writing ingestion code.
Ingest Schedule & The Parquet Manifest
- Update Cycle: I sync the latest publications and revisions from arXiv S3 once every month.
- Manifest Tracking: To support crash-resilient resuming, validation, and incremental syncing, the project maintains a central XML manifest file: arxiv_parquet_manifest.xml. This manifest maps each Parquet partition file to its size, MD5 checksum, processed timestamp, range of paper IDs (
first_itemandlast_item), and the list of raw S3.tarfiles that were unpacked to generate it.
Dataset Schema
Every row represents a single paper with metadata and parsed LaTeX source contents:
| Column Name | Type | Description |
|---|---|---|
id |
string |
arXiv paper identifier (e.g. 0704.0001 or hep-th/9901001). |
yymm_id |
string |
Normalized ID mapped to YYMM format for chronological sorting. |
submitter |
string |
Name of the user who uploaded the paper. |
authors |
string |
Raw authors string. |
title |
string |
Title of the paper. |
comments |
string |
Submitter comments or journal references. |
journal-ref |
string |
Official journal publication reference (if published). |
doi |
string |
Digital Object Identifier (DOI). |
report-no |
string |
Report or document series numbers. |
categories |
string |
Space-separated arXiv categories (e.g., cs.CL math.PR). |
license |
string |
License under which the paper was published. |
abstract |
string |
The paper's abstract. |
versions |
list<struct> |
Struct list of versions with creation timestamps. |
update_date |
string |
Date the paper record was last modified by arXiv. |
authors_parsed |
list<list<string>> |
Split author names (structured by Last Name, First Name, suffix). |
latex |
large_string |
The parsed, compiled LaTeX source code from the paper. All source files (.tex, .bib, .sty, etc.) are bundled into a single readable Markdown-style tree structure. |
Curation & Licensing
This dataset mirrors data provided under arXiv's Terms of Use. The copyright and licenses of individual paper contents are retained by their respective authors, and correspond to the license identifier specified in the license column.
- Downloads last month
- 6,426
