Available Datasets
We provide data generation scripts to generate standard datasets.
Dataset | Description | Training/Dev/Test Size | Vocabulary | Download |
---|---|---|---|---|
WMT'16 EN-DE | Data for the WMT'16 Translation Task English to German. Training data is combined from Europarl v7, Common Crawl, and News Commentary v11. Development data sets include newstest[2010-2015] . newstest2016 should serve as test data. All SGM files were converted to plain text. |
4.56M/3K/2.6K | 32k BPE | Generate Download |
WMT'17 All Pairs | Data for the WMT'17 Translation Task. | Coming soon. | Coming soon. | Coming soon |
Toy Copy | A toy dataset where the target sequence is equal to the source sequence. The model must learn to copy the source sequence. | 10k/1k/1k | 20 | Generate |
Toy Reverse | A toy dataset where the target sequence is equal to the reversed source sequence. The model must learn to reverse the source sequence. | 10k/1k/1k | 20 | Generate |
Creating your own data
To create your own data, we recommend taking a look at the data generation scripts above. A typical data preprocessing pipeline looks as follows:
- Generate data in parallel text format
- Tokenize your data
- Create fixed vocabularies for your source and target data
- Learn and apply subword units to handle rare and unknown words