Label Disparities

One of the hardest choices for an NER model to make is when it sees the same text span annotated with 2 different labels in different contexts.

This is also one of the most useful things for a model to learn. For example, lots of people are named after cities they were born in or that have some significance to their parents.

from recon import Dataset
from recon.operations.tokenization import *  # noqa
from recon.types import Example, Span


def main():
    person_example = Example(
        text="My friend is named Dallas.",
        spans=[Span(text="Dallas", start=19, end=25, label="PER")],
    )
    gpe_example = Example(
        text="Dallas is a city in Texas.",
        spans=[Span(text="Dallas", start=0, end=6, label="LOC")],
    )

    ds = Dataset("DallasExamples", [person_example, gpe_example])
    ds.apply_("recon.add_tokens.v1")

    ds.data[0].show()
    ds.data[1].show()


if __name__ == "__main__":
    main()

"Dallas" is a person's name in the first example and "Dallas" is a location in the second example (according to CONLL annotation guidelines).

The label is correct in both cases and whichever NER model we want to train will need to rely on the context of the sentence to figure out which label (if any) to assign the span "Dallas" with in future predictions.

That being said, sometimes the distinction between these labels is harder or annotators just make mistakes. To find these sorts of mistakes, we can use Recon's get_label_disparities function.

from pprint import pprint

from recon import Corpus
from recon.insights import get_label_disparities


def main():
    # Load the Conll Corpus
    corpus = Corpus.from_disk("./examples/data/conll2003", "conll2003")

    test_ld = get_label_disparities(corpus.test, "LOC", "PER")
    pprint(test_ld)


if __name__ == "__main__":
    main()