Using a Corpus
So far, we have been operating on a Dataset which represents a single split of our data. Recon's Corpus
container allows us to work with our full train/dev/test split by managing a separate Dataset
for each split. The Corpus
handles either 2 (train, dev) or 3 (train, dev, test) Datasets. If you happen to split up your data in some other way, you may need to just manage the lower level Dataset
for each split on your own.
Recon's Corpus
class provides the same apply
method as Dataset
that accepts an Operation
.
S
Update script to use Dataset.apply
¶
Let's edit that main.py
file you created in the previous step to utilize the Corpus.apply
method.
from pathlib import Path
import typer
from recon import Corpus
def main(data_dir: Path):
corpus = Corpus.from_disk(data_dir)
res = corpus.apply("recon.v1.get_ner_stats")
for name, stats in res.items():
print(f"{name}")
print("=" * 50)
print(stats)
if __name__ == "__main__":
typer.run(main)
Run the application¶
Now, if you run your script you should get the following output:
$ python main.py ./examples/data
train
==================================================
{
"n_examples":102,
"n_examples_no_entities":29,
"n_annotations_per_type":{
"SKILL":191,
"PRODUCT":34,
"JOB_ROLE":5
}
}
dev
==================================================
{
"n_examples":110,
"n_examples_no_entities":49,
"n_annotations_per_type":{
"SKILL":159,
"PRODUCT":20,
"JOB_ROLE":1
}
}
test
==================================================
{
"n_examples":96,
"n_examples_no_entities":38,
"n_annotations_per_type":{
"PRODUCT":35,
"SKILL":107,
"JOB_ROLE":2
}
}
all
==================================================
{
"n_examples":308,
"n_examples_no_entities":116,
"n_annotations_per_type":{
"SKILL":457,
"PRODUCT":89,
"JOB_ROLE":8
}
}
Analyzing the results¶
Now that we have a good understanding of the distribution of labels in across our train/dev/test split as well as the summation of all those numbers to the "all" data, we can start to see some issues.
1. Not enough JOB_ROLE
annotations¶
We clearly don't have enough annotations of the JOB_ROLE
in our data. There's no way an NER model could learn to capture JOB_ROLE
in a generic way with only 8 total annotations.
2. Barely enough PRODUCT
annotations¶
We're also a little low (though not nearly as much) on our PRODUCT
label.
What to do from here¶
We want our final model to be equally good at extracting these 3 labels of SKILL
, PRODUCT
and JOB_ROLE
so we now know exactly where to invest more time in our annotations effort: getting more examples of JOB_ROLE
and PRODUCT
.
Next Steps¶
We've only scratched the surface of Recon. It's great to have these global stats about our dataset so we can make sure we're trending in the right direction as we annotate more data. But this information doesn't help us debug the data we already have.
For example, 34 of our 191 SKILL
annotations in our train
set might actually be instances where JOB_ROLE
or PRODUCT
is more appropriate.
Or, we might have subsets of our data annotated by different people that had a slightly different understanding of the annotation requirements (or just made a couple mistakes), creating disparities in the final dataset.
In the next step of this tutorial we'll put away the toy skills dataset and take a look at the widely used Conll 2003 Benchmark Dataset. We'll use Recon to find and correct errors in the original dataset and publish our new and improved Conll 2003 dataset.