Reranker Framework (ReFr)
Reranking framework for structure prediction and discriminative language modeling
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
Namespaces | Variables
hadoop-run.py File Reference

A python program which will train a reranking model on a Hadoop cluster using the Iterative Parameter Mixtures perceptron training algorithm. More...

Go to the source code of this file.

Namespaces

 hadoop-run
 

Variables

tuple hadoop-run.optParse = OptionParser()
 The following arguments are available to hadoop-run.py. More...
 
string hadoop-run.help = "Location of hadoop installation. If not set, "
 
string hadoop-run.default = ""
 
string hadoop-run.action = "append"
 
 hadoop-run.hadooproot = options.hadooproot
 
 hadoop-run.streamingloc = options.streamingloc
 
string hadoop-run.tmppath = hadooproot+"/contrib/streaming"
 
tuple hadoop-run.streamingjar = glob.glob(tmppath + "/hadoop-streaming*.jar")
 
list hadoop-run.filenames = []
 Collect input filenames. More...
 
tuple hadoop-run.hdproc
 Create output directory if it does not exist. More...
 
string hadoop-run.train_map_options = ""
 Configuration for training optionsOptions passed to the mapper binary. More...
 
string hadoop-run.train_files = ""
 
tuple hadoop-run.train_map
 
string hadoop-run.extractsym_map = "'"
 Shortcuts to command-line programs. More...
 
string hadoop-run.compiledata_map = "'"
 
string hadoop-run.train_reduce = options.refrbin+"/model-merge-reducer"
 
string hadoop-run.train_recomb = options.refrbin+"/model-combine-shards"
 
string hadoop-run.symbol_recomb = options.refrbin+"/model-combine-symbols"
 
string hadoop-run.pipeeval_options = ""
 
string hadoop-run.pipeeval = options.refrbin+"/piped-model-evaluator"
 
string hadoop-run.hadoop_inputfiles = ""
 
 hadoop-run.precompdevfile = options.develdata
 Precopilation of string features. More...
 
string hadoop-run.symbol_dir = options.hdfsinputdir+"/Symbols/"
 
string hadoop-run.precomp_dir = options.hdfsinputdir+"/Precompiled/"
 
string hadoop-run.precompdev_dir = options.hdfsinputdir+"/PrecompiledDev/"
 
string hadoop-run.addl_data = ""
 
string hadoop-run.symfile_name = options.outputdir+"/"
 
 hadoop-run.cur_model = options.inputmodel
 
 hadoop-run.converged = False
 
tuple hadoop-run.iteration = int(options.startiter)
 
int hadoop-run.prev_loss = -9999
 
list hadoop-run.loss_history = []
 
int hadoop-run.num_in_decline = 0
 
int hadoop-run.best_loss_index = 0
 
string hadoop-run.eval_cmd = pipeeval+" -d "
 
tuple hadoop-run.evalio = pyutil.CommandIO(eval_cmd)
 
string hadoop-run.iter_str = "'"
 
string hadoop-run.model_output = options.outputdir+"/"
 
string hadoop-run.proc_cmd = train_recomb+" -o "
 
int hadoop-run.devtest_score = 0
 
float hadoop-run.loss = 0.0
 
list hadoop-run.diff = loss_history[-1]
 

Detailed Description

A python program which will train a reranking model on a Hadoop cluster using the Iterative Parameter Mixtures perceptron training algorithm.

You must first have a Hadoop account configured. In order to train, you will need to have the following:

The program will attempt to locate the Hadoop binary and the Hadoop streaming library. If this fails, you can specify these via command-line parameters (–hadooproot and –streamingloc).

Usage: hadoop-run.py –input InputData –hdfsinputdir HDFSIndir \ –hdfsoutputdir HDFSOutDir –outputdir OutputDir

InputData - A comma-separated list of file globs containing the training data. These must be accessible by script. OutputDir - The local directory where the trained model(s) are written. The default model name is 'model'. You can change this using the –modelname command-line parameter. HDFSInDir - A directory on HDFS where the input data will be copied to. HDFSOutDir - A directory on HDFS where the temporary data and output data will be written to. The final models are copied to the locally-accessible OutputDir.

Check input command line options.

Author
kbhal.nosp@m.l@go.nosp@m.ogle..nosp@m.com (Keith Hall)

Definition in file hadoop-run.py.