Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
What's on this page?
Here are some things that might be of interest to linguists, such as open access materials that may not be officially in our collection. Please recommend possibilities to me by emailing me (see contact info to the right)!
Data for Linguists
The Harvard Dataverse Network is available to MIT researchers to store and make available final versions of the data that they create or compile. For more information, see Social Science Data Services: Harvard Dataverse Network Deposit Guidelines. Additional sources of linguistic data are below.
Chinese-English Parallel Translations from TranslateFX
Available in .tsv format. Sources include: government legislations, regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents.
Corpus of Contemporary American English (COCA)
Largest structured corpus of American English composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. The corpus is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts (2015-08-24)
Corpus of Historical American English (COHA)
Corpus of Historical American English (COHA) is composed of more than 400 million words of text in more than 100,000 individual texts.
Linguistic Data Consortium
MIT Libraries provides access to LDC data corpora published in 2016 and after. See: https://libguides.mit.edu/ldc
Linguistics Data Repositories
From the centralized re3data.org data repository directory.
OPUS is a collection of translated texts from the web.
Oregon Health and Science University # 0681-G Kids' Speech v1.1 Corpus
The Kids' Speech Corpus was developed to facilitate research about the characteristics of kids' speech at different ages and to train and evaluate recognizers for use in language training and other interactive tasks involving children.
Papuan Malay Data
The data set includes the Papuan Malay word list sound files and database, compiled and recorded in the context of Kluge’s (2014) dissertation ‘A grammar of Papuan Malay’.
SIL International Language & Culture Archives
The Language & Culture Archives exists so that ethnolinguistic minority communities benefit from the preservation of and/or open access to knowledge and resources collected, compiled, or created as the result of SIL's service to these communities in the pursuit of their language development goals.
The Tromsø Repository of Language and Linguistics
Natural Language Toolkit
NLTK is a platform for building Python programs to work with human language data. It provides interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. NLTK is available for Windows, Mac OS X, and Linux, and is a free, open source, community-driven project.
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Natural language processing tool.
World Atlas of Language Structures (WALS)
WALS: The first of these links goes to the record describing the 2005 book. The second is the newest edition of this work, an online version, which has an accompanying interactive reference tool that you may wish to download.
The World Atlas of Language Structures by
Publication Date: 2005-10-06
The World Atlas of Language Structures is a book and CD combination displaying the structural properties of the world's languages. 142 world maps and numerous regional maps - all in colour - display the geographical distribution of features of pronunciation and grammar, such as number of vowels, tone systems, gender, plurals, tense, word order, and body part terminology. Each world map shows an average of 400 languages and is accompanied by a fully referenced description of the structural feature in question. The CD provides an interactive electronic version of the database which allows the reader to zoom in on or customize the maps, to display bibliographical sources, and to establish correlations between features.
World Atlas of Language Structures (WALS)
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.
WALS Interactive Reference Tool
The Interactive Reference Tool (available on first CD-ROM, as well as on web) will allow the atlas user to view the maps in a variety of different forms, as well as to combine features, i.e. to generate compound features and to display these as well.