Corpus types included in LDC:
"T" indicates a text corpus
"S" indicates a speech audio corpus
"V" indicates a video corpus
"L" indicates a lexicon
The Linguistic Data Consortium (LDC) creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. LDC is an open consortium of universities, companies and government research laboratories with the University of Pennsylvania being LDC's host institution. LDC was founded in 1992 with a grant from the Advanced Research Projects Agency (ARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation.
Authorized MIT users (current faculty, students, and staff) have access to all of the corpora that LDC has produced since 2016 to the present from the MIT Libraries.
Browse the LDC catalog here: https://catalog.ldc.upenn.edu/
You need an individual account, approved by the libraries, to download data. For account creation directions, see below.
For corpora produced before 2016, we may be able to purchase these on an as-needed basis depending on cost and licensing terms. Contact the MIT Libraries LDC team at firstname.lastname@example.org with the following information: name of the corpus needed and link to the LDC catalog entry, your name, email and department and (if applicable) your lab/group, and a brief description of what you need the corpus for.
For the Lincoln Laboratory, there are special subscriptions available to affiliates. Lincoln Laboratory affiliates should get in touch with Douglas Reynolds (dar[at]ll.mit.edu).
To access corpora from 2016-present that are available for download:
For all MIT users, available corpora are listed by year on the LDC site (MIT Libraries’ access to LDC corpora is limited to corpora published from 2016 on). Note that the corpora available as CDs or DVDs from 2016-present can be accessed by individual title through the Library’s Catalog.