For a concrete of how to run the training script, refer to the Neural Machine Translation Tutorial.

Configuring Training

Also see Configuration. The configuration for input data, models, and training parameters is done via YAML. You can pass YAML strings directly to the training script, or create configuration files and pass their paths to the script. These two approaches are technically equivalent. However, large YAML strings can become difficult to manage so we recommend the latter one. For example, the following two are equivalent:

1. Pass FLAGS directly:

python -m bin.train \
  --model AttentionSeq2Seq \
  --model_params "
    embedding.dim: 256
    encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
    encoder.params:
      rnn_cell:
        cell_class: GRUCell"

2. Define config.yml

model: AttentionSeq2Seq
model_params:
  embedding.dim: 256
  encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
  encoder.params:
    rnn_cell:
      cell_class: GRUCell

... and pass FLAGS via config:

python -m bin.train --config_paths config.yml

Multiple configuration files are merged recursively, in the order they are passed. This means you can have separate configuration files for model hyperparameters, input data, and training options, and mix and match as needed.

For a concrete examples of configuration files, refer to the example configurations and Neural Machine Translation Tutorial.

Monitoring Training

In addition to looking at the output of the training script, Tensorflow write summaries and training logs into the specified output_dir. Use Tensorboard to visualize training progress.

tensorboard --logdir=/path/to/model/dir

Distributed Training

Distributed Training is supported out of the box using tf.learn. Cluster Configurations can be specified using the TF_CONFIG environment variable, which is parsed by the RunConfig. Refer to the Distributed Tensorflow Guide for more information.

Training script Reference

The train.py script has many more options.

Argument Default Description
config_paths "" Path to a YAML configuration file defining FLAG values. Multiple files can be separated by commas. Files are merged recursively. Setting a key in these files is equivalent to setting the FLAG value with the same name.
hooks "[]" YAML configuration string for the training hooks to use.
metrics "[]" YAML configuration string for the training metrics to use.
model "" Name of the model class. Can be either a fully-qualified name, or the name of a class defined in seq2seq.models.
model_params "{}" YAML configuration string for the model parameters.
input_pipeline_train "{}" YAML configuration string for the training data input pipeline.
input_pipeline_dev "{}" YAML configuration string for the development data input pipeline.
buckets None Buckets input sequences according to these length. A comma-separated list of sequence length buckets, e.g. "10,20,30" would result in 4 buckets: <10, 10-20, 20-30, >30. None disables bucketing.
batch_size 16 Batch size used for training and evaluation.
output_dir None The directory to write model checkpoints and summaries to. If None, a local temporary directory is created.
train_steps None Maximum number of training steps to run. If None, train forever.
eval_every_n_steps 1000 Run evaluation on validation data every N steps.
tf_random_seed None Random seed for TensorFlow initializers. Setting this value allows consistency between reruns.
save_checkpoints_secs 600 Save checkpoints every N seconds. Can not be specified with save_checkpoints_steps.
save_checkpoints_steps None Save checkpoints every N steps. Can not be specified with save_checkpoints_secs.
keep_checkpoint_max 5 Maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept.
keep_checkpoint_every_n_hours 4 In addition to keeping the most recent checkpoint files, keep one checkpoint file for every N hours of training.