Available Datasets

We provide data generation scripts to generate standard datasets.

Dataset Description Training/Dev/Test Size Vocabulary Download
WMT'16 EN-DE Data for the WMT'16 Translation Task English to German. Training data is combined from Europarl v7, Common Crawl, and News Commentary v11. Development data sets include newstest[2010-2015]. newstest2016 should serve as test data. All SGM files were converted to plain text. 4.56M/3K/2.6K 32k BPE Generate
Download
WMT'17 All Pairs Data for the WMT'17 Translation Task. Coming soon. Coming soon. Coming soon
Toy Copy A toy dataset where the target sequence is equal to the source sequence. The model must learn to copy the source sequence. 10k/1k/1k 20 Generate
Toy Reverse A toy dataset where the target sequence is equal to the reversed source sequence. The model must learn to reverse the source sequence. 10k/1k/1k 20 Generate

Creating your own data

To create your own data, we recommend taking a look at the data generation scripts above. A typical data preprocessing pipeline looks as follows:

  1. Generate data in parallel text format
  2. Tokenize your data
  3. Create fixed vocabularies for your source and target data
  4. Learn and apply subword units to handle rare and unknown words