Data

Available Datasets

We provide data generation scripts to generate standard datasets.

Dataset	Description	Training/Dev/Test Size	Vocabulary	Download
WMT'16 EN-DE	Data for the WMT'16 Translation Task English to German. Training data is combined from Europarl v7, Common Crawl, and News Commentary v11. Development data sets include `newstest[2010-2015]`. `newstest2016` should serve as test data. All SGM files were converted to plain text.	4.56M/3K/2.6K	32k BPE	Generate Download
WMT'17 All Pairs	Data for the WMT'17 Translation Task.	Coming soon.	Coming soon.	Coming soon
Toy Copy	A toy dataset where the target sequence is equal to the source sequence. The model must learn to copy the source sequence.	10k/1k/1k	20	Generate
Toy Reverse	A toy dataset where the target sequence is equal to the reversed source sequence. The model must learn to reverse the source sequence.	10k/1k/1k	20	Generate

To create your own data, we recommend taking a look at the data generation scripts above. A typical data preprocessing pipeline looks as follows: