Persistent Identifier
|
doi:10.7910/DVN/AMUDUW |
Publication Date
|
2015-11-09 |
Title
| Corpus of Contemporary American English (COCA) |
Author
| Davies, Mark (Brigham Young University) |
Point of Contact
|
Use email button above to contact.
Jennie Murack |
Description
| Largest structured corpus of American English composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. The corpus is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts (2015-08-24) |
Subject
| Arts and Humanities; Other |
Keyword
| English language (LCSH) http://id.loc.gov/authorities/subjects.html
Corpora (Linguistics) (LCSH) http://id.loc.gov/authorities/subjects.html
Computational linguistics (LCSH) http://id.loc.gov/authorities/subjects.html |
Notes
| MIT affiliates should access this dataset by logging into Dataverse and selecting Massachusetts Institute of Technology. The various file formats contain the following types of data: • Database files: A balanced collection of words pulled from fiction, popular magazines, newspapers, non-fiction books, and spoken word sources. • Lexicon: includes information on each wordID: word (e.g., walked), lemma (e.g., walk), and part of speech (e.g., vvd). • Sources: genre or country, source, and title of each text. • Text: Provides a textID for each text, and then the entire text on the same line, with no annotations. Note: In this format, words are not annotated for part of speech or lemma. In addition, contracted words like are separated into two parts (ca|n't) and punctuation is separated from words (eye level . As her).” • Word Lemma PoS (Part of Speech): Tables that list each word, lemma, and part of speech in vertical format; can be imported into a database. Note: Word, lemma and PoS are the three parts that are included in the lexicon. “Word” is the actual word pulled from the text. “Lemma” is the basic core “word” that would be included as the headword in a dictionary. For example, if the “word” is “running,” the lemma is “run.” The PoS is the part of speech. • subgenreCodes: a file explaining the codes used to identify sub-genres, or sub-categories, within each of the major categories of text. See http://corpus.byu.edu/coca/?f=texts_e. More information on files is available at: • http://corpus.byu.edu/full-text/formats.asp • http://corpus.byu.edu/full-text/database.asp |
Language
| English |
Producer
| Davies, Mark (Brigham Young University) |
Depositor
| McNeill, Katherine |
Deposit Date
| 2015-10-06 |
Time Period
| Start Date: 1990 ; End Date: 2012 |
Data Type
| linguistic corpora |
Related Dataset
| Corpus of Historical American English (COHA) |
Data Source
| The corpus is composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. Detailed information on sources is available at: http://corpus.byu.edu/coca/?f=texts_e. Main sources for each file type are as follows: • Spoken: (95 million words [95,385,672]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts). • Fiction: (90 million words [90,344,134]) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts. • Popular Magazines: (95 million words [95,564,706]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc. • Newspapers: (92 million words [91,680,966]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc. • Academic Journals: (91 million words [91,044,778]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year |