LibGuides: LDC data: Home

Related guides

Corpus types included in LDC

Corpus types included in LDC:

"T" indicates a text corpus
"S" indicates a speech audio corpus
"V" indicates a video corpus
"L" indicates a lexicon

About LDC

The Linguistic Data Consortium (LDC) creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. LDC is an open consortium of universities, companies and government research laboratories with the University of Pennsylvania being LDC's host institution. LDC was founded in 1992 with a grant from the Advanced Research Projects Agency (ARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation.

What is available

Authorized MIT users (current faculty, students, and staff) have access to all of the corpora that LDC has produced since 2016 to the present from the MIT Libraries.

Browse the LDC catalog here: https://catalog.ldc.upenn.edu/

You need an individual account, approved by the libraries, to download data. For account creation directions, see below.

For corpora produced before 2016, we may be able to purchase these on an as-needed basis depending on cost and licensing terms. Contact the MIT Libraries LDC team at ldc-lib@mit.edu with the following information: name of the corpus needed and link to the LDC catalog entry, your name, email and department and (if applicable) your lab/group, and a brief description of what you need the corpus for.

For the Lincoln Laboratory, there are special subscriptions available to affiliates. Lincoln Laboratory affiliates should get in touch with Douglas Reynolds (dar[at]ll.mit.edu).

How to access our LDC collections

To access corpora from 2016-present that are available for download:

First, register with LDC using your MIT email address and your current department or lab.
1. Under "Organization" look for "Massachusetts Institute of Technology - MIT - Libraries" and select that option from the resulting drop-down menu. Do not use abbreviated form (e.g. "MIT") please type the whole string above to access the correct account.
2. The libraries will approve your account within one business day. (If you don't receive an email, check your spam filter).
After your account is approved, to access data, login to LDC and then go to the downloads tab under "your account options".
1. The datasets that MIT Libraries has access to will be listed.
2. If you need a corpus that is only available by hard drive or requires a separate license before download, please email ldc-lib@mit.edu. Please note that if you need to use a dataset that requires a special license signed by the individual user, your name may be available to other MIT LDC users.
3. If you need a pre-2016 corpus that we don’t have subscription access to, please fill out this form. Please note that typical delivery time is 5-7 business days – this can vary based on corpora cost and license(s), so some may take longer!

For all MIT users, available corpora are listed by year on the LDC site (MIT Libraries’ access to LDC corpora is limited to corpora published from 2016 on). Note that the corpora available as CDs or DVDs from 2016-present can be accessed by individual title through the Library’s Catalog.

Linguistics Librarian

Ece Turnator

Contact:

Ece (pronounced AJ)
turnator@mit.edu
Hayden Library Consultation Suite 14S - 2nd floor
617.253.4979
Make an appointment with me

Librarian for Electrical Engineering & Computer Science, IDSS, and Mathematics

Phoebe Ayers

Contact:

psayers@mit.edu
Room 10-500
617.253.4442

How can Phoebe help you?