NER Stats
Getting statistics about your NER data can be extremely helpful throughout the annotation process. It can help you ensure that you're spending time on the right annotations and that you have enough examples for each type as well as enough examples with no entities at all (this is often overlooked but very important to build a model that generalizes well).
Once you have your data loaded either by itself as a list of Example
s or as a Dataset
you can easily get statistics using the stats.ner_stats
function.
The stats.get_ner_stats
function expects a List[Example]
as it's input parameter and will return a serializable response with info about your data. Let's see how this works on the provided example data.
Example¶
Create a file main.py with:
from pathlib import Path
import typer
from recon.loaders import read_jsonl
from recon.stats import get_ner_stats
def main(data_file: Path):
data = read_jsonl(data_file)
print(get_ner_stats(data))
if __name__ == "__main__":
typer.run(main)
Run the application with the example data and you should see the following results.
$ python main.py ./examples/data/skills/train.jsonl
{
"n_examples":106,
"n_examples_no_entities":29,
"n_annotations":243,
"n_annotations_per_type":{
"SKILL":197,
"PRODUCT":33,
"JOB_ROLE":10,
"skill":2,
"product":1
}
}
Great! We have some basic stats about our data but we can already see some issues. Looks like some of our examples are annotated with lowercase labels. These are obviously mistakes and we'll see how to fix these shortly.
But first, it isn't super helpful to have stats on just your train
data.
And it'd be really annoying to have to call the same function on each list of examples:
train = read_jsonl(train_file)
print(get_ner_stats(train))
dev = read_jsonl(dev_file)
print(get_ner_stats(dev))
test = read_jsonl(test_file)
print(get_ner_stats(test))
Next Steps¶
In the next step step of this tutorial we'll introduce the core containers Recon uses for managing examples and state:
-
Dataset
- ADataset
has a name and holds a list of examples. Its also responsible for tracking any mutations done to its internal data throught Recon operations. (More on this later) -
Corpus
. ACorpus
is a wrapper around a set of datasets that represent a typical train/eval or train/dev/test split. Using aCorpus
allows you to gain insights on how well your train set represents your dev/test sets.