LibGuides: Linguistics: Corpora and Tools of Interest to Linguistics

Data for Linguists

The Harvard Dataverse Network is available to MIT researchers to store and make available final versions of the data that they create or compile. For more information, see Social Science Data Services: Harvard Dataverse Network Deposit Guidelines. Additional sources of linguistic data are below.

Chinese-English Parallel Translations from TranslateFX
Available in .tsv format. Sources include: government legislations, regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents.

more... less...

All texts are from the Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong, and Hong Kong government websites.
Corpus of Contemporary American English (COCA)
Largest structured corpus of American English composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. The corpus is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts (2015-08-24)

more... less...

• Lexicon: includes information on each wordID: word (e.g., walked), lemma (e.g., walk), and part of speech (e.g., vvd). • Sources: genre or country, source, and title of each text. • Text: Provides a textID for each text, and then the entire text on the same line, with no annotations.
Corpus of Historical American English (COHA)
Corpus of Historical American English (COHA) is composed of more than 400 million words of text in more than 100,000 individual texts.
Linguistic Data Consortium
MIT Libraries provides access to LDC data corpora published in 2016 and after. See: https://libguides.mit.edu/ldc
Linguistics Data Repositories
From the centralized re3data.org data repository directory.
OPUS
OPUS is a collection of translated texts from the web.

more... less...

In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package.
Oregon Health and Science University # 0681-G Kids' Speech v1.1 Corpus
Corpus of speech gathered from about 100 children in grades K-10 in Oregon, USA. Developed to facilitate research on characteristics of kids' speech at different ages and train and evaluate recognizers for use in language training and other interactive tasks involving children.
Papuan Malay Data
The data set includes the Papuan Malay word list sound files and database, compiled and recorded in the context of Kluge’s (2014) dissertation ‘A grammar of Papuan Malay’.
SIL International Language & Culture Archives
The Language & Culture Archives exists so that ethnolinguistic minority communities benefit from the preservation of and/or open access to knowledge and resources collected, compiled, or created as the result of SIL's service to these communities in the pursuit of their language development goals.

more... less...

Information on how to use the Archives
TROLLing Dataverse
The Tromsø Repository of Language and Linguistics

more... less...

The Tromsø Repository of Language and Linguistics (TROLLing) is designed as an archive of linguistic data and statistical code. The archive is open access, which means that all information is available to to everyone. All postings are accompanied by searchable metadata that identify the researchers, the languages and linguistic phenomena involved, the statistical methods applied, and scholarly publications based on the data (where relevant).

Linguists worldwide are invited to post datasets and statistical models used in linguistic research. The TROLLing Steering Committee is responsible for the scientific content of the archive, whereas the University Library provides quality and relevance control, in addition to user management. The University Library also oversees the technical and legal structure of TROLLing.

TROLLing is built on the Dataverse Network, a SW originated from Harvard University, and developed in coordination with CLARIN (Common Language Resources and Technology Infrastructure, a networked federation of European data repositories).

Tools

Natural Language Toolkit
NLTK is a platform for building Python programs to work with human language data. Also see https://www.nltk.org/book/ for NLTK Book updated for Python 3
TextBlob
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
spaCy
Natural language processing tool.

World Atlas of Language Structures (WALS) and Glottolog

WALS: The first of these links goes to the record describing the 2005 book. The second is the newest edition of this work, an online version, which has an accompanying interactive reference tool that you may wish to download.

Glottolog
Is a comprehensive reference information for the world's languages, especially the lesser known languages.
Glottolog is an initiative of the Max Planck Institute for Evolutionary Anthropology, Leipzig.
The World Atlas of Language Structures by Bernard Comrie; Hans-Jörg Bibiko (As told to); Hagen Jung (As told to); Claudia Schmidt (As told to); Martin Haspelmath; David Gil; Matthew S. Dryer (Editor)
ISBN: 0199255911

Publication Date: 2005-10-06

The World Atlas of Language Structures is a book and CD combination displaying the structural properties of the world's languages. 142 world maps and numerous regional maps - all in colour - display the geographical distribution of features of pronunciation and grammar, such as number of vowels, tone systems, gender, plurals, tense, word order, and body part terminology. Each world map shows an average of 400 languages and is accompanied by a fully referenced description of the structural feature in question. The CD provides an interactive electronic version of the database which allows the reader to zoom in on or customize the maps, to display bibliographical sources, and to establish correlations between features.

WALS Interactive Reference Tool
The Interactive Reference Tool (available on first CD-ROM, as well as on web) will allow the atlas user to view the maps in a variety of different forms, as well as to combine features, i.e. to generate compound features and to display these as well.

more... less...

The interactive database will also contain additional information on languages (genealogical classification, alternative names) and on each language-feature pair (bibliographical reference, example sentence). The interactive maps can be zoomed and panned, dot colors and shapes can be customized, a few map properties (rivers, country names, etc.) are switchable, and languages can be searched by language name, family and genus name, country, and region within country. With the mouse over effect the corresponded language name is shown immediately and with a click the language profile appears in a separate window. The generation of compound features will be very useful for typological research. For example, the user will be able to correlate the existence of an question-word-fronting rule with particular word order types, the existence of tone with the size of the consonant inventory, or the alignment type (accusative, ergative, active-inactive) with the head-dependent marking type. Furthermore geographical and genealogical information can be included.
World Atlas of Language Structures (WALS)
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.

more... less...

The first version of WALS was published as a book with CD-ROM in 2005 by Oxford University Press. (See link above) The first online version was published in April 2008. The second online version was published in April 2011.

The 2013 edition of WALS corrects a number of coding errors especially in Chapters 1 and 3. A full list of changes is available here.

Linguistics Librarian

Ece Turnator

Contact:

Ece (pronounced AJ)
turnator@mit.edu
Hayden Library Consultation Suite 14S - 2nd floor
617.253.4979
Make an appointment with me

Linguistics: Corpora and Tools of Interest to Linguistics

What's on this page?

Data for Linguists

Tools

World Atlas of Language Structures (WALS) and Glottolog

Linguistics Librarian