Are you thinking of donating a corpus of speech/language data to
We typically accept only transcribed files. If you only have raw
audio or video, we would need to discuss the best ways to get your data
into the linked CHAT transcript format that FluencyBank uses. However,
we can convert virtually any other format to CHAT, including SALT. If
your data contain personal identifiers, we can work with you to remove
We can host your data as fully open access or behind a password,
depending upon the level of consent that you obtained from participants,
the type of media (we can extract audio from video), and how
identifiable the transcripts would be.
Corpora can be longitudinal or cross-sectional, mixed, or
Each contributed corpus should have a documentation file. This “readme”
file (preferably in plain text) should contain a basic set of facts that
are indispensable for the proper interpretation of the data by other
researchers or users. The minimum set of facts that should be in each
readme file are the following:
- In a longitudinal corpus, one or more speakers are transcribed in a
series of interactions over time.
- Cross-sectional studies typically have groups of different speakers
(perhaps divided by age, diagnostic group [e.g., individuals who do/do
not stutter] or both).
- A number of well-known corpora in fluency, such as the Illinois
International Stuttering Project, are both cross-sectional and
longitudinal. These corpora track individual children over time to
identify potential factors in persistence and recovery.
- Finally, some corpora in our teaching section illustrate specific
behaviors, such as features of stuttering and typical fluency, response
to DAF, discussions about affective/cognitive components of stuttering,
and sample therapies.
When these data are complete, please contact Brian MacWhinney
(email@example.com) and Nan Bernstein Ratner (firstname.lastname@example.org) for
instructions on how to transfer data through the WeTransfer system as
described at https://talkbank.org/share/contrib.html
THANK YOU for supporting FluencyBank.
- Donor information. We post pictures of contributors and their
contact information on the home page for all corpora. Please see
individual corpora already in FluencyBank or CHILDES for examples.
- Acknowledgments. There should be a statement that asks the user to
cite some particular reference when using the corpus. For example,
researchers using the Adam, Eve, and Sarah corpora from Roger Brown and
his colleagues are asked to cite Brown (1973). In addition, all users
can cite this current manual as the source for the TalkBank system in
- Restrictions. If the data are being contributed to TalkBank,
contributors can set particular restrictions on the use of their data.
For example, researchers may ask that they be sent copies of articles
that make use of their data. Many researchers have chosen to set no
limitations at all on the use of their data.
- Warnings. This documentation file should also warn other
researchers about limitations on the use of the data. For example, if an
investigator paid no attention to correct transcription of speech
errors, this should be noted.
- Pseudonyms. The readme file should also include information on
whether informants gave informed consent for the use of their data and
whether pseudonyms have been used to preserve informant anonymity. In
general, real names should be replaced by pseudonyms. Anonymization is
not necessary when the subject of the transcriptions is the researcher's
own child, as long as the child grants permission for the use of the
- Project Description. There should be detailed information on the
history, motivation, and procedures of the project. How was funding
obtained? What were the goals of the project? How was data collected?
What was the sampling procedure? How was transcription done? What was
ignored in transcription? Were transcribers trained? Was reliability
checked? Was coding done? What codes were used? Was the material
- Codes. If there are project-specific codes, these should be
- Demographic data. Wherever possible, demographic, dialectological,
and psychometric data should be provided for each informant.
Particularly for research data, there should be information on topics
such as age, gender, schooling, social class, occupation, and so forth.
- Situational descriptions. The readme file should include
descriptions of the contexts of the recordings, such as the task or the
nature of the activities being recorded. Additional specific situational
information should be included in the @Situation and @Comment fields in
each file, as appropriate. For example, in fluency, we’d certainly like
to distinguish between conversation, monologue, narrative, experimental
- For data specifically contributed for teaching purposes, it helps
us if you group and somehow label, both by filename and in the headers
(@Comment) what concept you think the file best illustrates. This helps
us to organize activities around your contribution. Your own exercises
using these files are also welcome.