A python program which will train a reranking model on a Hadoop cluster using the Iterative Parameter Mixtures perceptron training algorithm. More...

Namespaces
	hadoop-run

Variables
tuple	hadoop-run.optParse = OptionParser()
	The following arguments are available to hadoop-run.py. More...

string	hadoop-run.help = "Location of hadoop installation. If not set, "

string	hadoop-run.default = ""

string	hadoop-run.action = "append"

	hadoop-run.hadooproot = options.hadooproot

	hadoop-run.streamingloc = options.streamingloc

string	hadoop-run.tmppath = hadooproot+"/contrib/streaming"

tuple	hadoop-run.streamingjar = glob.glob(tmppath + "/hadoop-streaming*.jar")

list	hadoop-run.filenames = []
	Collect input filenames. More...

tuple	hadoop-run.hdproc
	Create output directory if it does not exist. More...

string	hadoop-run.train_map_options = ""
	Configuration for training optionsOptions passed to the mapper binary. More...

string	hadoop-run.train_files = ""

tuple	hadoop-run.train_map

string	hadoop-run.extractsym_map = "'"
	Shortcuts to command-line programs. More...

string	hadoop-run.compiledata_map = "'"

string	hadoop-run.train_reduce = options.refrbin+"/model-merge-reducer"

string	hadoop-run.train_recomb = options.refrbin+"/model-combine-shards"

string	hadoop-run.symbol_recomb = options.refrbin+"/model-combine-symbols"

string	hadoop-run.pipeeval_options = ""

string	hadoop-run.pipeeval = options.refrbin+"/piped-model-evaluator"

string	hadoop-run.hadoop_inputfiles = ""

	hadoop-run.precompdevfile = options.develdata
	Precopilation of string features. More...

string	hadoop-run.symbol_dir = options.hdfsinputdir+"/Symbols/"

string	hadoop-run.precomp_dir = options.hdfsinputdir+"/Precompiled/"

string	hadoop-run.precompdev_dir = options.hdfsinputdir+"/PrecompiledDev/"

string	hadoop-run.addl_data = ""

string	hadoop-run.symfile_name = options.outputdir+"/"

	hadoop-run.cur_model = options.inputmodel

	hadoop-run.converged = False

tuple	hadoop-run.iteration = int(options.startiter)

int	hadoop-run.prev_loss = -9999

list	hadoop-run.loss_history = []

int	hadoop-run.num_in_decline = 0

int	hadoop-run.best_loss_index = 0

string	hadoop-run.eval_cmd = pipeeval+" -d "

tuple	hadoop-run.evalio = pyutil.CommandIO(eval_cmd)

string	hadoop-run.iter_str = "'"

string	hadoop-run.model_output = options.outputdir+"/"

string	hadoop-run.proc_cmd = train_recomb+" -o "

int	hadoop-run.devtest_score = 0

float	hadoop-run.loss = 0.0

list	hadoop-run.diff = loss_history[-1]

Detailed Description

A python program which will train a reranking model on a Hadoop cluster using the Iterative Parameter Mixtures perceptron training algorithm.

You must first have a Hadoop account configured. In order to train, you will need to have the following:

Training data locally accessible (accessible by the script)
A HadoopFS (HDFS) directory with enough space to store the input training data, the intermediate models and the final model.

The program will attempt to locate the Hadoop binary and the Hadoop streaming library. If this fails, you can specify these via command-line parameters (–hadooproot and –streamingloc).

Usage: hadoop-run.py –input InputData –hdfsinputdir HDFSIndir \ –hdfsoutputdir HDFSOutDir –outputdir OutputDir

InputData - A comma-separated list of file globs containing the training data. These must be accessible by script. OutputDir - The local directory where the trained model(s) are written. The default model name is 'model'. You can change this using the –modelname command-line parameter. HDFSInDir - A directory on HDFS where the input data will be copied to. HDFSOutDir - A directory on HDFS where the temporary data and output data will be written to. The final models are copied to the locally-accessible OutputDir.

Check input command line options.

Author: kbhal.nosp@m.l@go.nosp@m.ogle..nosp@m.com (Keith Hall)

Definition in file hadoop-run.py.

Namespaces

Variables

Detailed Description