Corpus of Contemporary American English (COCA)

Version 2.2

Davies, Mark, 2015, "Corpus of Contemporary American English (COCA)", https://doi.org/10.7910/DVN/AMUDUW, Harvard Dataverse, V2

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

4,076 Downloads

Description	Largest structured corpus of American English composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. The corpus is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts (2015-08-24)
Subject	Arts and Humanities; Other
Keyword	English language, Corpora (Linguistics), Computational linguistics
Notes	MIT affiliates should access this dataset by logging into Dataverse and selecting Massachusetts Institute of Technology. The various file formats contain the following types of data: • Database files: A balanced collection of words pulled from fiction, popular magazines, newspapers, non-fiction books, and spoken word sources. • Lexicon: includes information on each wordID: word (e.g., walked), lemma (e.g., walk), and part of speech (e.g., vvd). • Sources: genre or country, source, and title of each text. • Text: Provides a textID for each text, and then the entire text on the same line, with no annotations. Note: In this format, words are not annotated for part of speech or lemma. In addition, contracted words like are separated into two parts (ca\|n't) and punctuation is separated from words (eye level . As her).” • Word Lemma PoS (Part of Speech): Tables that list each word, lemma, and part of speech in vertical format; can be imported into a database. Note: Word, lemma and PoS are the three parts that are included in the lexicon. “Word” is the actual word pulled from the text. “Lemma” is the basic core “word” that would be included as the headword in a dictionary. For example, if the “word” is “running,” the lemma is “run.” The PoS is the part of speech. • subgenreCodes: a file explaining the codes used to identify sub-genres, or sub-categories, within each of the major categories of text. See http://corpus.byu.edu/coca/?f=texts_e. More information on files is available at: • http://corpus.byu.edu/full-text/formats.asp • http://corpus.byu.edu/full-text/database.asp
License/Data Use Agreement	Custom Dataset Terms

Filter by

	1 to 10 of 18 Files	Download
	coca-sources.txt Plain Text - 16.0 MB Published Nov 9, 2015 2,227 Downloads MD5: 82df9ce24eeafc1bcc71979c83ed135d Genre or country, source, and title for each text. Documentation	Preview "coca-sources.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation EndNote XML RIS BibTeX
	db_academic_rpe.zip ZIP Archive - 445.8 MB Published Nov 9, 2015 54 Downloads MD5: 1634501502071d66c1d02ffbd12c1d2d Set of database files for academic sources; one file per year. Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX
	db_fiction_awq.zip ZIP Archive - 449.2 MB Published Nov 9, 2015 54 Downloads MD5: 8d8eab8b9d5161ba55b42a2030b7f587 Set of database files for fiction sources; one file per year. Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX
	db_magazine_qjg.zip ZIP Archive - 474.4 MB Published Nov 9, 2015 53 Downloads MD5: 9e1f41278b373627ddb74520d278636b Set of database files for magazine sources; one file per year. Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX
	db_newspaper_lsp.zip ZIP Archive - 463.0 MB Published Nov 9, 2015 53 Downloads MD5: d41c28d81d29d6368e821d1b8df0647b Set of database files for newspaper sources; one file per year. Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX
	db_spoken_kde.zip ZIP Archive - 467.9 MB Published Nov 9, 2015 53 Downloads MD5: f80b3c78186da7147ee6f5b06a27ef5c Set of database files for spoken sources; one file per year. Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX
	lexicon.txt Plain Text - 155.6 MB Published Nov 9, 2015 55 Downloads MD5: f06d47b6a2ad45899e526bb099bb8eda Contains lexicon information for the data (see notes). Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX
	subgenreCodes.txt Plain Text - 745 B Published Nov 9, 2015 1,013 Downloads MD5: 3e27b9293c1e9d0b48b0bea0321f5354 Explains the codes used to identify sub-genres, or sub-categories, within each of the major categories of text. Documentation	Preview "subgenreCodes.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation EndNote XML RIS BibTeX
	text_academic_rpe.zip ZIP Archive - 185.3 MB Published Nov 9, 2015 54 Downloads MD5: 37e9eb5f8241d667921942a3be82894e Original text from academic sources; one file per year. Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX
	text_fiction_awq.zip ZIP Archive - 177.0 MB Published Nov 9, 2015 55 Downloads MD5: e9f87028e4715c95cbb0e3fc5708bc93 Original text from fiction sources; one file per year. Data	Access File File Access Restricted Users may not request access to files. Download Metadata Data File Citation EndNote XML RIS BibTeX

Citation Metadata

Persistent Identifier	doi:10.7910/DVN/AMUDUW
Publication Date	2015-11-09
Title	Corpus of Contemporary American English (COCA)
Author	Davies, Mark (Brigham Young University)
Point of Contact	Use email button above to contact. Jennie Murack
Description	Largest structured corpus of American English composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. The corpus is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts (2015-08-24)
Subject	Arts and Humanities; Other
Keyword	English language (LCSH) http://id.loc.gov/authorities/subjects.html Corpora (Linguistics) (LCSH) http://id.loc.gov/authorities/subjects.html Computational linguistics (LCSH) http://id.loc.gov/authorities/subjects.html
Notes	MIT affiliates should access this dataset by logging into Dataverse and selecting Massachusetts Institute of Technology. The various file formats contain the following types of data: • Database files: A balanced collection of words pulled from fiction, popular magazines, newspapers, non-fiction books, and spoken word sources. • Lexicon: includes information on each wordID: word (e.g., walked), lemma (e.g., walk), and part of speech (e.g., vvd). • Sources: genre or country, source, and title of each text. • Text: Provides a textID for each text, and then the entire text on the same line, with no annotations. Note: In this format, words are not annotated for part of speech or lemma. In addition, contracted words like are separated into two parts (ca\|n't) and punctuation is separated from words (eye level . As her).” • Word Lemma PoS (Part of Speech): Tables that list each word, lemma, and part of speech in vertical format; can be imported into a database. Note: Word, lemma and PoS are the three parts that are included in the lexicon. “Word” is the actual word pulled from the text. “Lemma” is the basic core “word” that would be included as the headword in a dictionary. For example, if the “word” is “running,” the lemma is “run.” The PoS is the part of speech. • subgenreCodes: a file explaining the codes used to identify sub-genres, or sub-categories, within each of the major categories of text. See http://corpus.byu.edu/coca/?f=texts_e. More information on files is available at: • http://corpus.byu.edu/full-text/formats.asp • http://corpus.byu.edu/full-text/database.asp
Language	English
Producer	Davies, Mark (Brigham Young University)
Depositor	McNeill, Katherine
Deposit Date	2015-10-06
Time Period	Start Date: 1990 ; End Date: 2012
Data Type	linguistic corpora
Related Dataset	Corpus of Historical American English (COHA)
Data Source	The corpus is composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. Detailed information on sources is available at: http://corpus.byu.edu/coca/?f=texts_e. Main sources for each file type are as follows: • Spoken: (95 million words [95,385,672]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts). • Fiction: (90 million words [90,344,134]) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts. • Popular Magazines: (95 million words [95,564,706]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc. • Newspapers: (92 million words [91,680,966]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc. • Academic Journals: (91 million words [91,044,778]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

Licensed electronic resources are restricted to members of the MIT community and for the purposes of research, education, and scholarship. Under MIT's licenses for electronic resources, users generally may not: - redistribute the materials or permit anyone other than a member of the MIT community to use them - remove, obscure or modify any copyright or other notices included in the materials - use the materials for commercial purposes. Users are individually responsible for compliance with these terms. This data is restricted to members of the MIT community for educational, scholarly, and research purposes. In no case can the data be distributed beyond the MIT community, even in joint research with individuals at other institutions. 1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data. 2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the MIT community. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus. 3. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.) 4. Any publications or products that are based on the data should contain a reference to the source of the data: http://corpus.byu.edu/full-text.” 5. Note that a small, unique change will be made to each set of data, and this will serve as a "fingerprint" to identify you as the source of this data. Automated Google searches are run daily to find copies of the data on the Web. If the data that is sent to you is found outside of your organization, you will make a reasonable effort to contact the administrators for that web page or website, to have the data removed.

Restricted Files + Terms of Access

Restricted Files

There are 16 restricted files in this dataset.

Terms of Access for Restricted Files

MIT affiliates should access this dataset by logging into Dataverse and selecting Massachusetts Institute of Technology.

Request Access

Users may not request access to files.

	Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Restricted Files Selected

The selected file(s) may not be downloaded because you have not been granted access.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 15.0 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Restricted Files Selected

The restricted file(s) selected may not be downloaded because you have not been granted access.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Private URL

Private URL can only be used with unpublished versions of datasets.

Unpublished Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? The selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? It will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Sign Up or Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom terms specific to this dataset Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

Terms of Use Licensed electronic resources are restricted to members of the MIT community and for the purposes of research, education, and scholarship. Under MIT's licenses for electronic resources, users generally may not: - redistribute the materials or permit anyone other than a member of the MIT community to use them - remove, obscure or modify any copyright or other notices included in the materials - use the materials for commercial purposes. Users are individually responsible for compliance with these terms. This data is restricted to members of the MIT community for educational, scholarly, and research purposes. In no case can the data be distributed beyond the MIT community, even in joint research with individuals at other institutions. 1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data. 2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the MIT community. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus. 3. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.) 4. Any publications or products that are based on the data should contain a reference to the source of the data: http://corpus.byu.edu/full-text.” 5. Note that a small, unique change will be made to each set of data, and this will serve as a "fingerprint" to identify you as the source of this data. Automated Google searches are run daily to find copies of the data on the Web. If the data that is sent to you is found outside of your organization, you will make a reasonable effort to contact the administrators for that web page or website, to have the data removed.

Name

Institution

Position

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://dataverse.harvard.edu/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

By default datasets are published with the CC0-“Public Domain Dedication” waiver. Learn more about the CC0 waiver here.

To publish with custom Terms of Use, click the Cancel button and go to the Terms tab for this dataset.

Select if this is a minor or major version update.

Minor Release (2.3)

Major Release (3.0)

Publish Dataset

This dataset cannot be published until MIT Libraries Dataverse is published by its administrator.

Publish Dataset

This dataset cannot be published until MIT Libraries Dataverse and Harvard Dataverse are published.

Return to Author

Return this dataset to contributor for modification.