Skip to content

Introduction to Conll 2003

The Conll 2003 NER Dataset is a widely used, cited, and benchmarked dataset for Named Entity Recognition. The Dataset has four entity types: persons (PER), locations (LOC), organizations (ORG), and names of miscellaneous entities (MISC) that don't belong in the other 3 groups.

For the rest of this tutorial, we'll use Recon to find and correct errors in the original Conll 2003 NER dataset.

The Conll 2003 data is publicly available and we'll be utilizing HuggingFace Datasets to download it.

Tip

TL;DR If you're looking for a shorter version of this tutorial, check out the Conll 2003 Jupyter Notebook in the project examples folder.

Loading data from HuggingFace Datasets

Recon has a specialized loader for HuggingFace Datasets based on the Tabular format of the Conll 2003 data. An example row of the raw data looks like this.

id (string) tokens (json) pos_tags (json) chunk_tags (json) ner_tags (json)
0 [ "EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "." ] [ 22, 42, 16, 21, 35, 37, 16, 21, 7] [ 11, 21, 11, 12, 21, 22, 11, 12, 0 ] [ 3, 0, 7, 0, 0, 0, 7, 0, 0 ]

We're primaliry interested in the tokens and ner_tags columns. The ner_tags are integer tags from 0-7

The data is already split into train/dev/test datasets so we'll load the whole HF Dataset into a Recon Corpus to get started.

from datasets.load import load_dataset

from recon import Corpus, Dataset


def main():
    # Download the raw data using HF Datasets
    conll2003 = load_dataset("conll2003")

    # Define the str BIO labels based on the numerical tags stored in HF Datasets
    conll_labels = [
        "O",
        "B-PER",
        "I-PER",
        "B-ORG",
        "I-ORG",
        "B-LOC",
        "I-LOC",
        "B-MISC",
        "I-MISC",
    ]

    # Use the Dataset.from_hf_dataset method to make an
    # Example from each row and create Recon Datasets for
    # each split in the raw data
    train_ds = Dataset("train").from_hf_dataset(conll2003["train"], labels=conll_labels)
    dev_ds = Dataset("dev").from_hf_dataset(
        conll2003["validation"], labels=conll_labels
    )
    test_ds = Dataset("test").from_hf_dataset(conll2003["test"], labels=conll_labels)

    # Initialize a Recon Corpus from the 3 Datasets
    corpus = Corpus("conll2003", train_ds, dev_ds, test_ds)
    print(corpus)

    corpus.to_disk("./examples/data/conll2003", overwrite=True)


if __name__ == "__main__":
    main()

To run, first make sure you have the datasets library installed.

pip install datasets

If you run the code above you'll get the summary statistics for each data split from Recon

$ python main.py
Dataset
Name: train
Stats: {
    "n_examples": 14042,
    "n_examples_no_entities": 2910,
    "n_annotations": 23499,
    "n_annotations_per_type": {
        "LOC": 7140,
        "PER": 6600,
        "ORG": 6321,
        "MISC": 3438
    }
}
Dataset
Name: dev
Stats: {
    "n_examples": 3251,
    "n_examples_no_entities": 646,
    "n_annotations": 5942,
    "n_annotations_per_type": {
        "PER": 1842,
        "LOC": 1837,
        "ORG": 1341,
        "MISC": 922
    }
}
Dataset
Name: test
Stats: {
    "n_examples": 3454,
    "n_examples_no_entities": 698,
    "n_annotations": 5648,
    "n_annotations_per_type": {
        "LOC": 1668,
        "ORG": 1661,
        "PER": 1617,
        "MISC": 702
    }
}

Next Steps

Now that we have the data loaded, let's see what other stats we can get besides the basic ones provided by Corpus.summary