Dataset Viewer

audio audioduration (s) 102 9.7k	file_name stringlengths 6 9	transcript stringlengths 922 47.5k	spanish_share float64 0 0.95	languages listlengths 2 2
herring1	well she was telling me about this thing with Oprah . that she's a new age thing . and she's a book . and she has a video blog . and she's promoting her book after her show . it's weird . I'm gonna go online to (.) see what it's all about . a new age thing ? yeah like that your own you're (.) your own god . like this a...	0.0446	[ "en", "es" ]	1,941.49
herring10	oh now that I now that I remember antes de que se me olvide . this lady from the school . mmhm . she told me she told me that she bought the subway passes (.) for the whole year . subway passes to where to New_York ? to New_York it's forty bucks . yeah but you know what she bought I told you she bought that pass . whic...	0.2322	[ "en", "es" ]	1,814.32
herring11	es en la en la primera vez que lo tocaron en mil ochocientos seis . uhuh . se oyó y ninguno le gustó . sí . y ah y lo tocaron entre otras piezas . como él tenía esa costumbre de tener tanta tanto material para tocar . y entonces después dice . el con el con siempre ha venido a largo tiempo . después de su primera creac...	0.9549	[ "en", "es" ]	1,863.14
herring12	cuándo vamos a salir ? no sé un día de estos . pero xxx este fin de semana . cómo ? que a la calle xxx solo xxx hoy . por qué ? ay bueno porque es el cumpleaños de la Estrella en la noche . qué Estrella ? la prima de Toni y lo va a celebrar en xxx . vas a ir . yo quería ir . pero allá imaginate nos caen todas no jodás ...	0.9416	[ "en", "es" ]	1,990.1
herring13	o_k o_k so eso these are new right ? o_k . those are new o_k . this is alright she left these here for you to I guess process them or whatever . mmhm . right right . whatever her it's she needs to sign it . o_k . o_k ? eh . so everything is the same except the finances ? everything is the same porque we switched it fro...	0.2605	[ "en", "es" ]	1,796.48
herring14	xxx . yeah . I was like +"/. +" oh vas a coger tu pelo cortico como Connor ? entonces él dijo +"/. +" no cortico no es muy es muy chiquito para mí . mal . so he said he was getting like (.) medium haircut . I know it because his (.) used to his : long hair and stuff yeah . pero qué más ? qué estábamos hablando antes de...	0.5644	[ "en", "es" ]	1,807.39
herring15	"so you don't even look at a computer .\nlike he ju(st) like he does it .\nbut (.) he just does as f(...TRUNCATED)	0.0046	[ "en", "es" ]	1,796.48
herring16	"ya saliste [=! laughs] ?\n&=laughs .\nno pero todavía no xxx cada palabra hace and and\nay Dios m(...TRUNCATED)	0.4083	[ "en", "es" ]	1,853.95
herring17	"there are plenty of new um alternatives .\naha .\nregarding um birth control xxx pills .\nyeah .\ns(...TRUNCATED)	0.3729	[ "en", "es" ]	1,801.29
herring2	"no se no se andan juntando con todo el mundo me entendés o sea que no hay\nno hay drama .\nsí exa(...TRUNCATED)	0.9541	[ "en", "es" ]	1,844.9

End of preview. Expand in Data Studio

Bangor Miami Spanish-English Corpus

The Bangor Miami Corpus is a naturalistic Spanish-English code-switching speech dataset collected by Jon Russell Herring at Bangor University. It captures spontaneous bilingual conversations recorded in Miami, Florida, involving proficient Spanish-English bilinguals across multiple speaker groups.

Dataset description


Total recordings	56
Total duration	~32 h
Languages	English (`en`), Spanish (`es`)
Format	MP3 audio + plain-text transcript
Speaker groups	herring (16), maria (15), sastre (13), zeledon (12)

Each row is one full recording session and contains:

Column	Type	Description
`audio`	Audio	Raw waveform decoded at original sampling rate
`file_name`	string	Recording identifier (e.g. `herring1`, `sastre3`)
`transcript`	string	Full conversation transcript, one utterance per line, cleaned of CHAT annotation markers
`spanish_share`	float	Fraction of Spanish words in the recording (0 = all English, 1 = all Spanish), computed from word-level language-ID annotations in the corpus
`languages`	list[string]	Always `["en", "es"]`
`duration_s`	float	Recording duration in seconds

Spanish share distribution

spanish_share is computed as:

spanish_share = count(langid == "spa") / count(langid != "999")

where langid comes from the word-level TSV annotations shipped with the corpus and 999 marks punctuation tokens. The corpus spans from near-monolingual English (< 2 %) to near-monolingual Spanish (> 95 %), with a mean of ~34 % Spanish words.

Configurations

`default`

All 56 recordings (~32 h total).

`mixed`

A ~2.5 h subset of 5 recordings restricted to genuinely mixed conversations where 20%–80% of content words are Spanish. Recordings are drawn from all four speaker groups and cover the full 0.2–0.8 Spanish-share range.

Speaker	File	Duration	Spanish share
herring	herring17	~30 min	0.37
maria	maria20	~32 min	0.37
sastre	sastre8	~33 min	0.39
zeledon	zeledon4	~22 min	0.43
zeledon	zeledon14	~33 min	0.79

Source & citation

The original corpus was collected and transcribed at Bangor University. If you use this dataset, please cite the original work:

@misc{bangor_miami,
 author = {Deuchar, Margaret and Davies, Peredur and Herring, Jon Russell
 and Parafita Couto, Maria Carmen and Carter, Diana},
 title = {Building bilingual corpora},
 booktitle = {Bilingualism: Basic principles and beyond},
 editor = {Thomas, Enlli and Mennen, Ineke},
 year = {2014},
 publisher = {Multilingual Matters},
 address = {Bristol}
}

The corpus is available under CC BY-SA 3.0.

Downloads last month: 181

URL: https://huggingface.co/datasets/BrunoHays/Bangor-Miami-Spanish-English-Corpus

⇱ BrunoHays/Bangor-Miami-Spanish-English-Corpus · Datasets at Hugging Face