Stats
calculate_entity_coverage_entropy(entity_coverage)
¶
Use Entropy to calculate a metric for entity coverage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity_coverage |
List[EntityCoverage]
|
List of EntityCoverage from get_entity_coverage |
required |
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
Entropy for entity coverage counts |
Source code in recon/stats.py
250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
|
calculate_entity_coverage_similarity(x, y)
¶
Calculate how well dataset x covers the entities in dataset y. This function should be used to calculate how similar your train set annotations cover the annotations in your dev/test set
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
List[Example]
|
Dataset to compare coverage to (usually corpus.train) |
required |
y |
List[Example]
|
Dataset to evaluate coverage for (usually corpus.dev or corpus.test) |
required |
Returns:
Name | Type | Description |
---|---|---|
EntityCoverageStats |
EntityCoverageStats
|
Stats with 1. The base entity coverage (does entity in y exist in x) 2. Count coverage (sum of the EntityCoverage.count property for each EntityCoverage in y to get a more holisic coverage scaled by how often entities occur in each dataset x and y) |
Source code in recon/stats.py
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
calculate_label_balance_entropy(ner_stats)
¶
Use Entropy to calculate a metric for label balance based on a Stats object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ner_stats |
Stats
|
Stats for a dataset. |
required |
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
Entropy for annotation counts of each label |
Source code in recon/stats.py
236 237 238 239 240 241 242 243 244 245 246 247 |
|
calculate_label_distribution_similarity(x, y)
¶
Calculate the similarity of the label distribution for 2 datasets.
e.g. This can help you understand how well your train set models your dev and test sets. Empircally you want a similarity over 0.8 when comparing your train set to each of your dev and test sets.
calculate_label_distribution_similarity(corpus.train, corpus.dev)
# 98.57
calculate_label_distribution_similarity(corpus.train, corpus.test)
# 73.29 - This is bad, let's investigate our test set more
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
List[Example]
|
Dataset |
required |
y |
List[Example]
|
Dataset to compare x to |
required |
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
Similarity of label distributions |
Source code in recon/stats.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
detect_outliers(seq, use_log=False)
¶
Detect outliers in a numerical sequence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seq |
Sequence[Any]
|
Sequence of ints or floats |
required |
use_log |
bool
|
Use logarithm of seq. |
False
|
Returns:
Type | Description |
---|---|
Outliers
|
Tuple[List[int], List[int]]: Tuple of low and high indices |
Source code in recon/stats.py
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
|
get_entity_coverage(data, sep='||', case_sensitive=False, return_examples=False)
¶
Identify how well you dataset covers an entity type. Get insights on the how many times certain text/label span combinations exist across your data so that you can focus your annotation efforts better rather than annotating examples your Model already understands well.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[Example]
|
List of examples |
required |
sep |
str
|
Separator used in coverage map, only change if || exists in your text or label. |
'||'
|
case_sensitive |
bool
|
Consider case of text for each annotation |
False
|
return_examples |
bool
|
Return Examples that contain the entity label annotation. |
False
|
Returns:
Type | Description |
---|---|
List[EntityCoverage]
|
List[EntityCoverage]: Sorted List of EntityCoverage objects containing the text, label, count, and an optional list of examples where that text/label annotation exists. |
Source code in recon/stats.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
|
get_ner_stats(data, return_examples=False)
¶
Compute statistics for NER data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Iterator[Example]
|
Data as a List of examples |
required |
return_examples |
bool
|
Whether to return examples per type |
False
|
Returns:
Name | Type | Description |
---|---|---|
Stats |
Stats
|
Summary stats from list of Examples |
Source code in recon/stats.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
get_probs_from_counts(seq)
¶
Convert a sequence of counts to a sequence of probabilties by dividing each n by the sum of all n in seq
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seq |
Sequence[int]
|
Sequence of counts |
required |
Returns:
Type | Description |
---|---|
Sequence[float]
|
Sequence[float]: Sequence of probabilities |
Source code in recon/stats.py
193 194 195 196 197 198 199 200 201 202 203 |
|
get_sorted_type_counts(ner_stats)
¶
Get list of counts for each type in n_annotations_per_type property of an Stats object sorted by type name
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ner_stats |
Stats
|
Dataset stats |
required |
Returns:
Type | Description |
---|---|
List[int]
|
List[int]: List of counts sorted by type name |
Source code in recon/stats.py
53 54 55 56 57 58 59 60 61 62 63 64 65 |
|