Label Disparities
One of the hardest choices for an NER model to make is when it sees the same text span annotated with 2 different labels in different contexts.
This is also one of the most useful things for a model to learn. For example, lots of people are named after cities they were born in or that have some significance to their parents.
from recon import Dataset
from recon.operations.tokenization import * # noqa
from recon.types import Example, Span
def main():
person_example = Example(
text="My friend is named Dallas.",
spans=[Span(text="Dallas", start=19, end=25, label="PER")],
)
gpe_example = Example(
text="Dallas is a city in Texas.",
spans=[Span(text="Dallas", start=0, end=6, label="LOC")],
)
ds = Dataset("DallasExamples", [person_example, gpe_example])
ds.apply_("recon.add_tokens.v1")
ds.data[0].show()
ds.data[1].show()
if __name__ == "__main__":
main()
"Dallas" is a person's name in the first example and "Dallas" is a location in the second example (according to CONLL annotation guidelines).
The label is correct in both cases and whichever NER model we want to train will need to rely on the context of the sentence to figure out which label (if any) to assign the span "Dallas" with in future predictions.
That being said, sometimes the distinction between these labels is harder or annotators just make mistakes. To find these sorts of mistakes, we can use Recon's get_label_disparities
function.
from pprint import pprint
from recon import Corpus
from recon.insights import get_label_disparities
def main():
# Load the Conll Corpus
corpus = Corpus.from_disk("./examples/data/conll2003", "conll2003")
test_ld = get_label_disparities(corpus.test, "LOC", "PER")
pprint(test_ld)
if __name__ == "__main__":
main()