FluencyBank English Davis Corpus
|
Eleonora Beier
Department of Psychology
University of California, Davis
ejbeier@ucdavis.edu
|
|
Nene (Suphasiree) Chantavarin
Department of Psychology
University of California, Davis
schantavarin@ucdavis.edu
|
|
Fernanda Ferreira
Department of Psychology
University of California, Davis
fferreira@ucdavis.edu
|
Participants: | 20 celebrities |
Type of Study: | longitudinal |
Location: | United States |
Media type: | audio |
DOI: | doi:10.21415/388K-5726 |
Beier, E., Chantavarin, S., & Ferreira, F. (submitted). Age doesn’t
matter, but speech rate does: A longitudinal corpus study of
disfluencies.
In accordance with TalkBank rules, any use of data from this corpus must
be accompanied by at least one of the above references.
Project Description
The disfluency_data.csv file includes these variables:
- Interviewee: person being interviewed
- Interview_age: age of the interviewee
- Interview_year: year the interview took place
- Occupation: occupation category for the interviewee
- N_segments: number of interview segments selected from longer
interviews in order to reach 2 minutes of the interviewee’s speech
- N_interviewer_segments: number of short interjections by someone
other than the interviewee (e.g., the interviewer) within the selected
interview segments
- Total_time: duration of all interview segments in seconds
- Interviewer_time: duration of all interviewer interjections in seconds
- Speech_only_time: duration of interviewee-only speech in seconds (Total_time – Interviewer_time)
- N_words: number of words uttered by the interviewee
- UhUhm_Count: number of filled pauses uttered by the interviewee
- Repeats_Count: number of repeats uttered by the interviewee
- Repairs_Count: number of repairs uttered by the interviewee
- N_syllables: number of syllables uttered by the interviewee,
calculated using the nsyllable function in the R package quanteda
(version 2.0.1, Benoit et al., 2018)
Additional files include:
- CohMetrix_data.csv: data used to perform exploratory analyses.
Interview transcriptions were analyzed using Coh-Metrix 3.0 (Graesser et
al., 2004; McNamara et al., 2014). For a description of each variable,
see http://cohmetrix.com/.
- Transcriptions_full: folder containing text files for all interview
transcriptions in their “raw” form. Each transcription contains all
interview segments, and includes short interjections by the interviewer
or others. {Curly brackets} denote the beginning and end of each
interview segment. [Square brackets] denote the beginning and end of
speech by anyone other than the interviewee (e.g., the interviewer).
Unintelligible words are denoted as (?). These transcriptions also
include non-words (e.g., uhm, hmm, laughs) which were later discarded; a
separate count of uhs and ums was used in the analyses.
- Transcriptions_clean: folder containing text files for all
interview transcriptions as they were used in our analyses (e.g., to
compute word and syllable rate). Speech by anyone other than the
interviewee was removed. All punctuation and all non-words were also
removed.
- statistical_analyses.R: mixed effects models used for all analyses,
using words per second as a measure of speech rate
- statistical_analyses_syllablerate.R: mixed effects models used for
all analyses, using syllables per second as a measure of speech rate