VOOZH about

URL: https://huggingface.co/datasets/BrunoHays/Bangor-Miami-Spanish-English-Corpus

⇱ BrunoHays/Bangor-Miami-Spanish-English-Corpus · Datasets at Hugging Face


audio
audioduration (s)
102
9.7k
file_name
stringlengths
6
9
transcript
stringlengths
922
47.5k
spanish_share
float64
0
0.95
languages
listlengths
2
2
duration_s
float64
103
9.72k
herring1
well she was telling me about this thing with Oprah . that she's a new age thing . and she's a book . and she has a video blog . and she's promoting her book after her show . it's weird . I'm gonna go online to (.) see what it's all about . a new age thing ? yeah like that your own you're (.) your own god . like this a...
0.0446
[ "en", "es" ]
1,941.49
herring10
oh now that I now that I remember antes de que se me olvide . this lady from the school . mmhm . she told me she told me that she bought the subway passes (.) for the whole year . subway passes to where to New_York ? to New_York it's forty bucks . yeah but you know what she bought I told you she bought that pass . whic...
0.2322
[ "en", "es" ]
1,814.32
herring11
es en la en la primera vez que lo tocaron en mil ochocientos seis . uhuh . se oyó y ninguno le gustó . sí . y ah y lo tocaron entre otras piezas . como él tenía esa costumbre de tener tanta tanto material para tocar . y entonces después dice . el con el con siempre ha venido a largo tiempo . después de su primera creac...
0.9549
[ "en", "es" ]
1,863.14
herring12
cuándo vamos a salir ? no sé un día de estos . pero xxx este fin de semana . cómo ? que a la calle xxx solo xxx hoy . por qué ? ay bueno porque es el cumpleaños de la Estrella en la noche . qué Estrella ? la prima de Toni y lo va a celebrar en xxx . vas a ir . yo quería ir . pero allá imaginate nos caen todas no jodás ...
0.9416
[ "en", "es" ]
1,990.1
herring13
o_k o_k so eso these are new right ? o_k . those are new o_k . this is alright she left these here for you to I guess process them or whatever . mmhm . right right . whatever her it's she needs to sign it . o_k . o_k ? eh . so everything is the same except the finances ? everything is the same porque we switched it fro...
0.2605
[ "en", "es" ]
1,796.48
herring14
xxx . yeah . I was like +"/. +" oh vas a coger tu pelo cortico como Connor ? entonces él dijo +"/. +" no cortico no es muy es muy chiquito para mí . mal . so he said he was getting like (.) medium haircut . I know it because his (.) used to his : long hair and stuff yeah . pero qué más ? qué estábamos hablando antes de...
0.5644
[ "en", "es" ]
1,807.39
herring15
"so you don't even look at a computer .\nlike he ju(st) like he does it .\nbut (.) he just does as f(...TRUNCATED)
0.0046
[ "en", "es" ]
1,796.48
herring16
"ya saliste [=! laughs] ?\n&=laughs .\nno pero todavía no xxx cada palabra hace and and\nay Dios m(...TRUNCATED)
0.4083
[ "en", "es" ]
1,853.95
herring17
"there are plenty of new um alternatives .\naha .\nregarding um birth control xxx pills .\nyeah .\ns(...TRUNCATED)
0.3729
[ "en", "es" ]
1,801.29
herring2
"no se no se andan juntando con todo el mundo me entendés o sea que no hay\nno hay drama .\nsí exa(...TRUNCATED)
0.9541
[ "en", "es" ]
1,844.9
End of preview. Expand in Data Studio

Bangor Miami Spanish-English Corpus

The Bangor Miami Corpus is a naturalistic Spanish-English code-switching speech dataset collected by Jon Russell Herring at Bangor University. It captures spontaneous bilingual conversations recorded in Miami, Florida, involving proficient Spanish-English bilinguals across multiple speaker groups.

Dataset description

Total recordings 56
Total duration ~32 h
Languages English (en), Spanish (es)
Format MP3 audio + plain-text transcript
Speaker groups herring (16), maria (15), sastre (13), zeledon (12)

Each row is one full recording session and contains:

Column Type Description
audio Audio Raw waveform decoded at original sampling rate
file_name string Recording identifier (e.g. herring1, sastre3)
transcript string Full conversation transcript, one utterance per line, cleaned of CHAT annotation markers
spanish_share float Fraction of Spanish words in the recording (0 = all English, 1 = all Spanish), computed from word-level language-ID annotations in the corpus
languages list[string] Always ["en", "es"]
duration_s float Recording duration in seconds

Spanish share distribution

spanish_share is computed as:

spanish_share = count(langid == "spa") / count(langid != "999")

where langid comes from the word-level TSV annotations shipped with the corpus and 999 marks punctuation tokens. The corpus spans from near-monolingual English (< 2 %) to near-monolingual Spanish (> 95 %), with a mean of ~34 % Spanish words.

Configurations

default

All 56 recordings (~32 h total).

mixed

A ~2.5 h subset of 5 recordings restricted to genuinely mixed conversations where 20%–80% of content words are Spanish. Recordings are drawn from all four speaker groups and cover the full 0.2–0.8 Spanish-share range.

Speaker File Duration Spanish share
herring herring17 ~30 min 0.37
maria maria20 ~32 min 0.37
sastre sastre8 ~33 min 0.39
zeledon zeledon4 ~22 min 0.43
zeledon zeledon14 ~33 min 0.79

Source & citation

The original corpus was collected and transcribed at Bangor University. If you use this dataset, please cite the original work:

@misc{bangor_miami,
 author = {Deuchar, Margaret and Davies, Peredur and Herring, Jon Russell
 and Parafita Couto, Maria Carmen and Carter, Diana},
 title = {Building bilingual corpora},
 booktitle = {Bilingualism: Basic principles and beyond},
 editor = {Thomas, Enlli and Mennen, Ineke},
 year = {2014},
 publisher = {Multilingual Matters},
 address = {Bristol}
}

The corpus is available under CC BY-SA 3.0.

Downloads last month
181