Corpus
Corpus
¶
Container for a full Corpus with train/dev/test splits. Used to apply core functions to all datasets at once.
Source code in recon/corpus.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
|
all: List[Example]
property
¶
Return concatenation of train/dev/test datasets
Returns:
Type | Description |
---|---|
List[Example]
|
List[Example]: All Examples in Corpus |
dev: List[Example]
property
¶
Return dev dataset
Returns:
Type | Description |
---|---|
List[Example]
|
List[Example]: Dev Examples |
dev_ds: Dataset
property
¶
name: str
property
¶
Get Corpus name
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Corpus name |
test: List[Example]
property
¶
Return test dataset
Returns:
Type | Description |
---|---|
List[Example]
|
List[Example]: Test Examples |
test_ds: Dataset
property
¶
train: List[Example]
property
¶
Return train dataset
Returns:
Type | Description |
---|---|
List[Example]
|
List[Example]: Train Examples |
train_ds: Dataset
property
¶
__init__(name, train, dev, test=None, example_store=None)
¶
Initialize a Corpus.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the Corpus |
required |
train |
Dataset
|
Dataset containing examples for train set |
required |
dev |
Dataset
|
Dataset containing examples for dev set |
required |
test |
Optional[Dataset]
|
Optional Dataset containing examples for test set |
None
|
example_store |
Optional[ExampleStore]
|
Optional ExampleStore |
None
|
Source code in recon/corpus.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
apply(func, *args, **kwargs)
¶
Apply a function to all datasets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
Union[str, StatsProtocol]
|
Function that operates on a list of examples and returns some result. Useful for running the same stats operation for each dataset. If a str is provided, a function is resolved from the stat functions registry |
required |
Returns:
Name | Type | Description |
---|---|---|
CorpusApplyResult |
CorpusApplyResult
|
CorpusApplyResult mapping dataset name to return type of func Callable |
Source code in recon/corpus.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
apply_(operation, *args, **kwargs)
¶
Apply an operation to each Dataset via Dataset.apply_
Parameters:
Name | Type | Description | Default |
---|---|---|---|
operation |
Union[str, Operation]
|
An Operation to modify the Dataset with. |
required |
Source code in recon/corpus.py
166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
from_disk(data_dir, name='corpus', train_name='train', dev_name='dev', test_name='test')
classmethod
¶
Load Corpus from disk given a directory with files named explicitly train.jsonl, dev.jsonl, and test.jsonl
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dir |
Path
|
directory to load from. |
required |
train_name |
str
|
Name of train data under data_dir. Defaults to train. |
'train'
|
dev_name |
str
|
Name of dev data under data_dir. Defaults to dev. |
'dev'
|
test_name |
str
|
Name of test data under data_dir. Defaults to test. |
'test'
|
Source code in recon/corpus.py
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
|
from_prodigy(name, prodigy_train_datasets, prodigy_dev_datasets, prodigy_test_datasets=None)
classmethod
¶
Load a Corpus from 3 separate datasets in Prodigy
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Corpus name |
required |
prodigy_train_datasets |
List[str]
|
Prodigy datasets to load as Recon train dataset |
required |
prodigy_dev_datasets |
List[str]
|
Prodigy datasets to load as Recon dev dataset |
required |
prodigy_test_datasets |
Optional[List[str]]
|
Prodigy datasets to load as Recon test dataset |
None
|
Returns:
Name | Type | Description |
---|---|---|
Corpus |
Corpus
|
Corpus initialized from prodigy datasets |
Source code in recon/corpus.py
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 |
|
pipe_(operations)
¶
Run a sequence of operations on each dataset. Calls Dataset.pipe_ for each dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
operations |
List[Union[str, OperationState]]
|
List of operations |
required |
Source code in recon/corpus.py
180 181 182 183 184 185 186 187 188 189 |
|
to_disk(output_dir, overwrite=False)
¶
Save Corpus to Disk
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_dir |
Path
|
Directory to save data to |
required |
overwrite |
bool
|
Force save to directory. Create parent directories and/or overwrite existing data. |
False
|
Source code in recon/corpus.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 |
|
to_prodigy(name=None, prodigy_train_dataset=None, prodigy_dev_dataset=None, prodigy_test_dataset=None, overwrite=True)
¶
Save a Corpus to 3 separate Prodigy datasets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
Optional[str]
|
Name prefix for datasets in Prodigy |
None
|
prodigy_train_dataset |
Optional[str]
|
Train dataset name in Prodigy |
None
|
prodigy_dev_dataset |
Optional[str]
|
Dev dataset name in Prodigy |
None
|
prodigy_test_dataset |
Optional[str]
|
Test dataset name in Prodigy |
None
|
Source code in recon/corpus.py
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
|