audio audioduration (s) 102 9.7k | file_name stringlengths 6 9 | transcript stringlengths 922 47.5k | spanish_share float64 0 0.95 | languages listlengths 2 2 | duration_s float64 103 9.72k |
|---|---|---|---|---|---|
herring1 | well she was telling me about this thing with Oprah .
that she's a new age thing .
and she's a book .
and she has a video blog .
and she's promoting her book after her show .
it's weird .
I'm gonna go online to (.) see what it's all about .
a new age thing ?
yeah like that your own you're (.) your own god .
like this
a... | 0.0446 | [
"en",
"es"
] | 1,941.49 | |
herring10 | oh now that I now that I remember antes de que se me olvide .
this lady from the school .
mmhm .
she told me she told me that she bought the subway passes (.) for the whole year .
subway passes to where to New_York ?
to New_York it's forty bucks .
yeah but you know what she bought I told you she bought that pass .
whic... | 0.2322 | [
"en",
"es"
] | 1,814.32 | |
herring11 | es en la en la primera vez que lo tocaron en mil ochocientos seis .
uhuh .
se oyó y ninguno le gustó .
sí .
y ah y lo tocaron entre otras piezas .
como él tenía esa costumbre de tener tanta tanto material para tocar .
y entonces después dice .
el con el con siempre ha venido a largo tiempo .
después de su primera creac... | 0.9549 | [
"en",
"es"
] | 1,863.14 | |
herring12 | cuándo vamos a salir ?
no sé un día de estos .
pero xxx este fin de semana .
cómo ?
que a la calle xxx solo xxx hoy .
por qué ?
ay bueno porque es el cumpleaños de la Estrella en la noche .
qué Estrella ?
la prima de Toni y lo va a celebrar en xxx .
vas a ir .
yo quería ir .
pero allá imaginate nos caen todas no jodás ... | 0.9416 | [
"en",
"es"
] | 1,990.1 | |
herring13 | o_k o_k so eso these are new right ?
o_k .
those are new o_k .
this is
alright she left these here for you to I guess process them or whatever .
mmhm .
right right .
whatever her it's she needs to sign it .
o_k .
o_k ?
eh .
so everything is the same except the finances ?
everything is the same porque we switched it fro... | 0.2605 | [
"en",
"es"
] | 1,796.48 | |
herring14 | xxx .
yeah .
I was like +"/.
+" oh vas a coger tu pelo cortico como Connor ?
entonces él dijo +"/.
+" no cortico no es muy es muy chiquito para mí .
mal .
so he said he was getting like (.) medium haircut .
I know it because his (.) used to his : long hair and stuff
yeah .
pero qué más ?
qué estábamos hablando antes de... | 0.5644 | [
"en",
"es"
] | 1,807.39 | |
herring15 | "so you don't even look at a computer .\nlike he ju(st) like he does it .\nbut (.) he just does as f(...TRUNCATED) | 0.0046 | [
"en",
"es"
] | 1,796.48 | |
herring16 | "ya saliste [=! laughs] ?\n&=laughs .\nno pero todavía no xxx cada palabra hace and and\nay Dios m(...TRUNCATED) | 0.4083 | [
"en",
"es"
] | 1,853.95 | |
herring17 | "there are plenty of new um alternatives .\naha .\nregarding um birth control xxx pills .\nyeah .\ns(...TRUNCATED) | 0.3729 | [
"en",
"es"
] | 1,801.29 | |
herring2 | "no se no se andan juntando con todo el mundo me entendés o sea que no hay\nno hay drama .\nsí exa(...TRUNCATED) | 0.9541 | [
"en",
"es"
] | 1,844.9 |
Bangor Miami Spanish-English Corpus
The Bangor Miami Corpus is a naturalistic Spanish-English code-switching speech dataset collected by Jon Russell Herring at Bangor University. It captures spontaneous bilingual conversations recorded in Miami, Florida, involving proficient Spanish-English bilinguals across multiple speaker groups.
Dataset description
| Total recordings | 56 |
| Total duration | ~32 h |
| Languages | English (en), Spanish (es) |
| Format | MP3 audio + plain-text transcript |
| Speaker groups | herring (16), maria (15), sastre (13), zeledon (12) |
Each row is one full recording session and contains:
| Column | Type | Description |
|---|---|---|
audio |
Audio | Raw waveform decoded at original sampling rate |
file_name |
string | Recording identifier (e.g. herring1, sastre3) |
transcript |
string | Full conversation transcript, one utterance per line, cleaned of CHAT annotation markers |
spanish_share |
float | Fraction of Spanish words in the recording (0 = all English, 1 = all Spanish), computed from word-level language-ID annotations in the corpus |
languages |
list[string] | Always ["en", "es"] |
duration_s |
float | Recording duration in seconds |
Spanish share distribution
spanish_share is computed as:
spanish_share = count(langid == "spa") / count(langid != "999")
where langid comes from the word-level TSV annotations shipped with the corpus and 999 marks punctuation tokens.
The corpus spans from near-monolingual English (< 2 %) to near-monolingual Spanish (> 95 %), with a mean of ~34 % Spanish words.
Configurations
default
All 56 recordings (~32 h total).
mixed
A ~2.5 h subset of 5 recordings restricted to genuinely mixed conversations where 20%–80% of content words are Spanish. Recordings are drawn from all four speaker groups and cover the full 0.2–0.8 Spanish-share range.
| Speaker | File | Duration | Spanish share |
|---|---|---|---|
| herring | herring17 | ~30 min | 0.37 |
| maria | maria20 | ~32 min | 0.37 |
| sastre | sastre8 | ~33 min | 0.39 |
| zeledon | zeledon4 | ~22 min | 0.43 |
| zeledon | zeledon14 | ~33 min | 0.79 |
Source & citation
The original corpus was collected and transcribed at Bangor University. If you use this dataset, please cite the original work:
@misc{bangor_miami,
author = {Deuchar, Margaret and Davies, Peredur and Herring, Jon Russell
and Parafita Couto, Maria Carmen and Carter, Diana},
title = {Building bilingual corpora},
booktitle = {Bilingualism: Basic principles and beyond},
editor = {Thomas, Enlli and Mennen, Ineke},
year = {2014},
publisher = {Multilingual Matters},
address = {Bristol}
}
The corpus is available under CC BY-SA 3.0.
- Downloads last month
- 181
