Datasets

From Slavko Zitnik's Research Wiki

Jump to: navigation, search

Contents

My Datasets

Slavko Public Facebook

The network contains public data, crawled from March to May 2012. It contains 51.394.379 nodes with 136.048.906 edges (4.431.920 double edges). There are 373.476 checked nodes and 50.687.612 unchecked nodes (situation in tables at the end). Non-anonymized data contains real Facebook ID, Name and Username of every user in the network. For more, see license file next to the network: Anonymized Facebook network. Some basic analysis is available as homework 3 to Big Networks class, taught by prof. dr. Vladimir Batagelj.

Slovene news

Slovene news v1 is tagged according to standard BIO scheme. The corpus contains annotated entities (B-PER (131), I-PER (74), B-ORG(162), I-ORG(158), O(5508)), relation descriptors (B-REL(32), I-REL(24), O(5977)) and coreference descriptors (B-COREF(274), I-COREF(249), 0(5510)). The dataset was lemmatized and POS-tagged using slovene POS tagger. In the dataset, there are 285 sentences with 6034 tokens.

Slovene news v2 is upgraded v1, which contains CoNLL2012-like tagged coreference tags and documents separated by ###.

The material can be used for research purposes only.

Web pages

Andrew McCallum
Malt
LDC
NLTK Corpuses

Datasets

CoNLL '03

The CoNLL '03 dataset concentrates on four types of named entities: persons, organizations, names and miscellaneous (do not belong to previous three groups). There are training/development/test splits in German and English language. The best achieved F-score on the shared task was 88.76+-0.7. To successfully build the dataset you need the Reuters Corpus that can be obtained without any charge.

CoNLL '02

CoNLL '02 defines the same task and different data (Spanish, Dutch) with the same named entities as in CoNLL '03. They were especially interested in methods that can use additional unannotated data for improving their performance (for example co-training). It contains POS tags and NER annotations.

Cora IE

Cora Information Extraction dataset contains research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

Reuters Corpora

RCV1,2 (Reuters Corpus Volume 1,2) is intended to be used in research and development of natural language processing, information retrieval, and machine learning systems. This corpus is significantly larger than the older, well-known Reuters-21578 collection (text categorization) heavily used in the text classification community.

RCV1 contains about 2.5GB uncompressed English News stories from 20th August 1997 to 19th August 1997. RCV2 is multilingual corpus from the same timeframe.

MUC-6

MUC-6 tasks included named entity recognition, coreference resolution, template elements and scenario templates (traditional information extraction)

CMU seminars

The dataset contains 48 emailed seminar announcements, with labeled segments for speaker, title, time, sentence, header and body. Labeled by Dayne Freitag.

OntoNotes Release 4.0

OntoNotes dataset is available through LDC only. It was developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

ACE 2004

The objective of the ACE (Automatic Content Extraction) program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic. Dataset

RDC (Relation Detection and Characterization), I am interested into, "involves the identification of relations between entities. This task was added in Phase 2 of ACE. The current definition of RDC targets physical relations including Located, Near and Part-Whole; social/personal relations including Business, Family and Other; a range of employment or membership relations; relations between artifacts and agents (including ownership); affiliation-type relations like ethnicity; relationships between persons and GPEs like citizenship; and finally discourse relations. For every relation, annotators identify two primary arguments (namely, the two ACE entities that are linked) as well as the relation's temporal attributes. Relations that are supported by explicit textual evidence are distinguished from those that depend on contextual inference on the part of the reader."

ACE 2005

ACE 2005 (English SpatialML Annotations Version 2) was developed by researchers at The MITRE Corporation and applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events. The corpus contains 210065 total words and 17821 unique words.

MUC-7

MUC-7 (Message Understanding Conference 7) datasets the dryrun (and training) consists of aircrashes scenarios and the formalrun consists of missile launches scenarios. The final version updates especially the Template Relations portion of the guidelines.

Brown Corpus

The Corpus contains POS-tagged texts.

IEER

This directory contains the NEWSWIRE development test data for the NIST 1999 IE-ER Evaluation. The files were taken from the subdirectory: /ie_er_99/english/devtest/newswire/*.ref.nwt and filenames were shortened. The dataset contains tagged PERSONS, DATES, ORGANIZATIONS, NUMBERS, LOCATIONS.

European Parliament

This is a sample of the European Parliament Proceedings Parallel Corpus 1996-2006. This sample contains 10 documents for 11 languages: Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish. The text is untagged.

IJCNLP 2011

New York Times data set contains 150 business articles from New York Times. The articles were crawled from the NYT website between November 2009 and January 2010. After sentence splitting and tokenization, the Stanford NER tagger was used to identify PER and ORG named entities from each sentence. For named entities that contain multiple tokens we concatenated them into a single token. We then took each pair of (PER, ORG) entities that occur in the same sentence as a single candidate relation instance, where the PER entity is treated as ARG-1 and the ORG entity is treated as ARG-2.

Wikipedia data set comes from (link), previously created by Aron Culotta et al.. Since the original data set did not contain the annotation information we need, we re-annotated it. Similarly, we performed sentence splitting, tokenization and NER tagging, and took pairs of (PER, PER) entities occurring in the same sentence as a candidate relation instance. We always treat the first PER entity as ARG-1 and the second PER entity as ARG-2.

A human annotator manually went through each candidate relation instance to decide (1) whether there is a relation between the two arguments and (2) whether there is an explicit sequence of words describing the relation held by ARG-1 and ARG-2.

The dataset can be downloaded here. There are 536 instances (208 P, 328 N) with 140 distinct descriptors int NYT dataset and 700 instances (122 P, 578 N) with 70 distinct descriptors.

Personal tools
Namespaces
Variants
Actions
My Research
Information Extraction
Teaching
Toolbox