Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

MIT Libraries logo MIT Libraries

Search Account

LDC data: Home

A guide to accessing datasets via LDC (Linguistics Data Consortium)

Corpus types included in LDC

Corpus types included in LDC: 

"T" indicates a text corpus
"S" indicates a speech audio corpus
"V" indicates a video corpus
"L" indicates a lexicon

About LDC

The Linguistic Data Consortium (LDC) creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. LDC is an open consortium of universities, companies and government research laboratories with the University of Pennsylvania being LDC's host institution. LDC was founded in 1992 with a grant from the Advanced Research Projects Agency (ARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation. 

What is available

Authorized MIT users (current faculty, students, and staff) have access to all of the corpora that LDC has produced since 2016 to the present from the MIT Libraries. For access, see directions below. 

For corpora produced before 2016, we may be able to purchase these on an as-needed basis depending on cost and licensing terms. Contact the MIT Libraries LDC team at ldc-lib@mit.edu with the following information: name of the corpus needed and link to the LDC catalog entry, your name, email and department and (if applicable) your lab/group, and a brief description of what you need the corpus for.  

For affiliates of CSAIL and the Lincoln Laboratory, there are special subscriptions available to affiliates. If you are a CSAIL affiliate please get in touch with Marcia Davidson (marcia[at]csail.mit.edu). Lincoln Laboratory affiliates should get in touch with Douglas Reynolds (dar[at]ll.mit.edu).

How to access our LDC collections

To access corpora from 2016-present that are available for download: 

  1. First, register with LDC using your MIT email address and your current department or lab
    1. Under "Organization" enter "MIT Libraries" and select that option from the resulting drop-down menu.
    2. The libraries will approve your account within one business day. (If you don't receive an email, check your spam filter).
  2. After your account is approved, to access data, login to LDC and then go to the downloads tab under "your account options".
    1. The datasets that MIT Libraries has access to will be listed.
    2. If you need a corpus that is only available by hard drive or requires a separate license before download, please email ldc-lib@mit.edu.
    3. If you need a pre-2016 corpus that we don’t have subscription access to, please fill out this form.

For all MIT users, available corpora are listed by year on the LDC site (MIT Libraries’ access to LDC corpora is limited to corpora published from 2016 on). Note that the corpora available as CDs or DVDs from 2016-present can be accessed by individual title through the Library’s Barton Catalog.

 

Linguistics Librarian

Ece Turnator's picture
Ece Turnator
Contact:
Ece (pronounced AJ)
turnator@mit.edu
Dewey Library E53 -168-C
617.253.4979

Librarian for Electrical Engineering & Computer Science, IDSS, and Mathematics