Reranker Framework (ReFr)
Reranking framework for structure prediction and discriminative language modeling
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
run-model.C
Go to the documentation of this file.
1 // Copyright 2012, Google Inc.
2 // All rights reserved.
3 //
4 // Redistribution and use in source and binary forms, with or without
5 // modification, are permitted provided that the following conditions are
6 // met:
7 //
8 // * Redistributions of source code must retain the above copyright
9 // notice, this list of conditions and the following disclaimer.
10 // * Redistributions in binary form must reproduce the above
11 // copyright notice, this list of conditions and the following disclaimer
12 // in the documentation and/or other materials provided with the
13 // distribution.
14 // * Neither the name of Google Inc. nor the names of its
15 // contributors may be used to endorse or promote products derived from
16 // this software without specific prior written permission.
17 //
18 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
19 // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
20 // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
21 // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
22 // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
23 // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
24 // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
25 // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
26 // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
27 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
29 // -----------------------------------------------------------------------------
30 //
31 //
40 
41 #include <iostream>
42 #include <fstream>
43 #include <string>
44 #include <cstdlib>
45 #include <memory>
46 #include <vector>
47 
48 #include "candidate.H"
49 #include "candidate-set.H"
50 #include "candidate-set-reader.H"
51 #include "candidate-set-writer.H"
53 #include "interpreter.H"
54 #include "model.H"
55 #include "model-merge-reducer.H"
56 #include "model-reader.H"
57 #include "model-proto-writer.H"
58 #include "perceptron-model.H"
59 #include "symbol-table.H"
60 
61 #define DEBUG 0
62 
63 #define PROG_NAME "run-model"
64 
65 #define DEFAULT_MAX_EXAMPLES -1
66 #define DEFAULT_MAX_CANDIDATES -1
67 #define DEFAULT_MODEL_CONFIG "PerceptronModel(name(\"MyPerceptronModel\"))"
68 #define DEFAULT_REPORTING_INTERVAL 1000
69 #define DEFAULT_COMPACTIFY_INTERVAL 10000
70 #define DEFAULT_USE_WEIGHTED_LOSS true
71 
72 // We use two levels of macros to get the string version of an int constant.
75 #define XSTR(arg) STR(arg)
76 #define STR(arg) #arg
78 
79 using namespace std;
80 using namespace reranker;
81 
82 const char *usage_msg[] = {
83  "Usage:\n",
84  PROG_NAME " --config <master config file>\n",
85  "\t-m|--model-file <model file> [--model-config <model config>]\n"
86  "\t[-t|--train <training input file>+ [-i <input model file>] [--mapper] ]\n",
87  "\t-d|--devtest <devtest input file>+\n",
88  "\t[-o|--output <candidate set output file>]\n",
89  "\t[-h <hyp output file>] [--scores <score output file>]\n",
90  "\t[--train-config <training feature extractor config file>]\n",
91  "\t[--dev-config <devtest feature extractor config file>]\n",
92  "\t[--compactify-feature-uids]\n",
93  "\t[-s|--streaming [--compactify-interval <interval>] ] [-u]\n",
94  "\t[--no-base64]\n",
95  "\t[--min-epochs <min epochs>] [--max-epochs <max epochs>]\n",
96  "\t[--max-examples <max num examples>]\n",
97  "\t[--max-candidates <max num candidates>]\n",
98  "\t[-r <reporting interval>] [ --use-weighted-loss[=][true|false] ]\n",
99  "where\n",
100  "\t<master config file> is a file in the interpreted factory language\n",
101  "\t\tcapable of specifying all options to this executable (see\n",
102  "\t\tconfig/default.infact for an example of the default options)\n",
103  "\t<model file> is the name of the file to which to write out a\n",
104  "\t\tnewly-trained model when training (one or more\n",
105  "\t\t<training input file>'s specified), or the name of a file\n",
106  "\t\tfrom which to load a serialized model when decoding\n",
107  "\t<input model file> is an optional input model file as a starting\n",
108  "\t\tmodel when training\n",
109  "\t<model config> is the optional configuration string for constructing\n",
110  "\t\ta new Model instance\n",
111  "\t\t(defaults to \"" DEFAULT_MODEL_CONFIG "\")\n",
112  "\t<training input file> is the name of a stream of serialized\n",
113  "\t\tCandidateSet instances, or \"-\" for input from standard input\n",
114  "\t--mapper specifies to train a single epoch and output features to\n",
115  "\t\tstandard output\n",
116  "\t<devtest input file> is the name of a stream of serialized\n",
117  "\t\tCandidateSet instances, or \"-\" for input from standard input\n",
118  "\t\t(required unless training in mapper mode)\n",
119  "\t<candidate set output file> is the name of the file to which to output\n",
120  "\t\tcandidate sets that have been scored by the model (in\n",
121  "\t\tdecoding mode)\n",
122  "\t<training feature extractor config file> is the name of a configuration\n",
123  "\t\tfile to be read by the ExecutiveFeatureExtractor instance\n"
124  "\t\textracting features on training examples\n",
125  "\t<devtest feature extractor config file> is the name of a configuration\n",
126  "\t\tfile to be read by the ExecutiveFeatureExtractor instance\n",
127  "\t\textracting features on devtest examples\n",
128  "\t--compactify-feature-uids specifies to re-map all feature uids to the\n",
129  "\t\t[0,n-1] interval, where n is the number of non-zero features\n",
130  "\t--streaming specifies to train in streaming mode (i.e., do not\n",
131  "\t\tread in all training instances into memory)\n",
132  "\t--compactify-interval specifies the interval after which to compactify\n",
133  "\t\tfeature uid's and remove unused symbols (only available when\n",
134  "\t\ttraining in streaming mode; defaults to "
136  "\t-u specifies that the input files are uncompressed\n",
137  "\t--no-base64 specifies not to use base64 encoding/decoding\n",
138  "\t--max-examples specifies the maximum number of examples to read from\n",
139  "\t\tany input file (defaults to " XSTR(DEFAULT_MAX_EXAMPLES) ")\n",
140  "\t--max-candidates specifies the maximum number of candidates to read\n",
141  "\t\tfor any candidate set (defaults to " XSTR(DEFAULT_MAX_CANDIDATES) ")\n",
142  "\t-r specifies the interval at which the CandidateSetReader reports how\n",
143  "\t\tmany candidate sets it has read (defaults to "
145  "\t--use-weighted-loss specifies whether to weight losses on devtest\n",
146  "\t\texamples by the number of tokens in the reference, where, e.g.,\n",
147  "\t\tweighted loss is appropriate for computing WER, but not BLEU\n",
148  "\t\t(defaults to " XSTR(DEFAULT_USE_WEIGHTED_LOSS) ")\n"
149 };
150 
153 void usage() {
154  int usage_msg_len = sizeof(usage_msg)/sizeof(const char *);
155  for (int i = 0; i < usage_msg_len; ++i) {
156  cout << usage_msg[i];
157  }
158  cout.flush();
159 }
160 
161 bool check_for_required_arg(int argc, int i, string err_msg) {
162  if (i + 1 >= argc) {
163  cerr << PROG_NAME << ": error: " << err_msg << endl;
164  usage();
165  return false;
166  } else {
167  return true;
168  }
169 }
170 
171 void read_and_extract_features(const vector<string> &files,
172  CandidateSetReader &csr,
173  bool compressed,
174  bool use_base64,
175  shared_ptr<ExecutiveFeatureExtractor> efe,
176  vector<shared_ptr<CandidateSet> > &examples) {
177  bool reset_counters = true;
178  for (vector<string>::const_iterator file_it = files.begin();
179  file_it != files.end();
180  ++file_it) {
181  csr.Read(*file_it, compressed, use_base64, reset_counters, examples);
182  }
183  if (efe.get() != NULL) {
184  // Extract features for CandidateSet instances in situ.
185  for (vector<shared_ptr<CandidateSet> >::iterator it = examples.begin();
186  it != examples.end();
187  ++it) {
188  efe->Extract(*(*it));
189  }
190  }
191 }
192 
193 int
194 main(int argc, char **argv) {
195  // Master configuration file.
196  string master_config_file;
197  // Required parameters.
198  string model_file;
199  string input_model_file;
200  string model_config = DEFAULT_MODEL_CONFIG;
201  vector<string> training_files;
202  vector<string> devtest_files;
203  bool mapper_mode = false;
204  string output_file;
205  string hyp_output_file;
206  string score_output_file;
207  string training_feature_extractor_config_file;
208  string devtest_feature_extractor_config_file;
209  bool compressed = true;
210  bool use_base64 = true;
211  bool streaming = false;
212  bool use_weighted_loss = DEFAULT_USE_WEIGHTED_LOSS;
213  string use_weighted_loss_arg_prefix = "--use-weighted-loss";
214  size_t use_weighted_loss_arg_prefix_len =
215  use_weighted_loss_arg_prefix.length();
216  bool compactify_feature_uids = false;
217  int compactify_interval = DEFAULT_COMPACTIFY_INTERVAL;
218  int min_epochs = -1;
219  int max_epochs = -1;
220  int max_examples = DEFAULT_MAX_EXAMPLES;
221  int max_candidates = DEFAULT_MAX_CANDIDATES;
222  int reporting_interval = DEFAULT_REPORTING_INTERVAL;
223 
224  shared_ptr<Model> model;
225  shared_ptr<ExecutiveFeatureExtractor> training_efe;
226  shared_ptr<ExecutiveFeatureExtractor> devtest_efe;
227 
228  // Preprocess options, looking for the --config option which specifies
229  // a master configuration file. This file should be used before
230  // any other command line options, which may be used to override anything
231  // set in the master configuration file.
232  int master_config_arg_idx = -1;
233  for (int i = 1; i < argc; ++i) {
234  string arg = argv[i];
235  if (arg == "--config") {
236  string err_msg =
237  string("no master configuration file specified with ") + arg;
238  if (!check_for_required_arg(argc, i, err_msg)) {
239  return -1;
240  }
241  master_config_file = argv[++i];
242  master_config_arg_idx = i - 1;
243  }
244  }
245  if (master_config_file != "") {
246  Interpreter i;
247  cerr << "Reading options from \"" << master_config_file << "\"." << endl;
248  i.Eval(master_config_file);
249  // Now, grab all variables that could be set and assign them to local
250  // variables. Note that the Interpreter::Get method only assigns a value
251  // to its second argument if it exists in the interpreter's environment.
252  i.Get("model_file", &model_file);
253  i.Get("model", &model);
254  i.Get("mapper_mode", &mapper_mode);
255  i.Get("training_files", &training_files);
256  i.Get("devtest_files", &devtest_files);
257  i.Get("output_file", &output_file);
258  i.Get("hyp_output_file", &hyp_output_file);
259  i.Get("training_efe", &training_efe);
260  i.Get("devtest_efe", &devtest_efe);
261  i.Get("compactify_feature_uids", &compactify_feature_uids);
262  i.Get("compactify_interval", &compactify_interval);
263  i.Get("streaming", &streaming);
264  i.Get("compressed", &compressed);
265  i.Get("use_base64", &use_base64);
266  i.Get("min_epochs", &min_epochs);
267  i.Get("max_epochs", &max_epochs);
268  i.Get("max_examples", &max_examples);
269  i.Get("max_candidates", &max_candidates);
270  i.Get("reporting_interval", &reporting_interval);
271  i.Get("use_weighted_loss", &use_weighted_loss);
272  }
273 
274  // Process options. The majority of code in this file is devoted to this.
275  for (int i = 1; i < argc; ++i) {
276  if (i == master_config_arg_idx) {
277  ++i;
278  continue;
279  }
280  string arg = argv[i];
281  if (arg == "-m" || arg == "-model" || arg == "--model") {
282  string err_msg = string("no model file specified with ") + arg;
283  if (!check_for_required_arg(argc, i, err_msg)) {
284  return -1;
285  }
286  model_file = argv[++i];
287  } else if (arg == "-i" || arg == "--i") {
288  string err_msg = string("no input model file specified with ") + arg;
289  if (!check_for_required_arg(argc, i, err_msg)) {
290  return -1;
291  }
292  input_model_file = argv[++i];
293  } else if (arg == "-model-config" || arg == "--model-config") {
294  string err_msg =
295  string("no model configuration string specified with ") + arg;
296  if (!check_for_required_arg(argc, i, err_msg)) {
297  return -1;
298  }
299  model_config = argv[++i];
300  } else if (arg == "-t" || arg == "-train" || arg == "--train") {
301  string err_msg = string("no input files specified with ") + arg;
302  if (!check_for_required_arg(argc, i, err_msg)) {
303  return -1;
304  }
305  // Keep reading args until next option or until no more args.
306  ++i;
307  for ( ; i < argc; ++i) {
308  if (argv[i][0] == '-' && strlen(argv[i]) > 1) {
309  --i;
310  break;
311  }
312  training_files.push_back(argv[i]);
313  }
314  } else if (arg == "-mapper" || arg == "--mapper") {
315  mapper_mode = true;
316  } else if (arg == "-d" || arg == "-devtest" || arg == "--devtest") {
317  string err_msg = string("no input files specified with ") + arg;
318  if (!check_for_required_arg(argc, i, err_msg)) {
319  return -1;
320  }
321  // Keep reading args until next option or until no more args.
322  ++i;
323  for ( ; i < argc; ++i) {
324  if (argv[i][0] == '-') {
325  --i;
326  break;
327  }
328  devtest_files.push_back(argv[i]);
329  }
330  } else if (arg == "-o" || arg == "-output" || arg == "--output") {
331  string err_msg = string("no output file specified with ") + arg;
332  if (!check_for_required_arg(argc, i, err_msg)) {
333  return -1;
334  }
335  output_file = argv[++i];
336  } else if (arg == "-h") {
337  string err_msg =
338  string("no hypothesis output file specified with ") + arg;
339  if (!check_for_required_arg(argc, i, err_msg)) {
340  return -1;
341  }
342  hyp_output_file = argv[++i];
343  } else if (arg == "-scores" || arg == "--scores") {
344  string err_msg =
345  string("no score output file specified with ") + arg;
346  if (!check_for_required_arg(argc, i, err_msg)) {
347  return -1;
348  }
349  score_output_file = argv[++i];
350  } else if (arg == "-train-config" || arg == "--train-config") {
351  string err_msg =
352  string("no feature extractor config file specified with ") + arg;
353  if (!check_for_required_arg(argc, i, err_msg)) {
354  return -1;
355  }
356  training_feature_extractor_config_file = argv[++i];
357  } else if (arg == "-dev-config" || arg == "--dev-config") {
358  string err_msg =
359  string("no feature extractor config file specified with ") + arg;
360  if (!check_for_required_arg(argc, i, err_msg)) {
361  return -1;
362  }
363  devtest_feature_extractor_config_file = argv[++i];
364  } else if (arg == "-compactify-feature-uids" ||
365  arg == "--compactify-feature-uids") {
366  compactify_feature_uids = true;
367  } else if (arg == "-s" || arg == "-streaming" || arg == "--streaming") {
368  streaming = true;
369  } else if (arg == "--compactify-interval") {
370  string err_msg = string("no interval specified with ") + arg;
371  if (!check_for_required_arg(argc, i, err_msg)) {
372  return -1;
373  }
374  compactify_interval = atoi(argv[++i]);
375  } else if (arg == "-u") {
376  compressed = false;
377  } else if (arg == "--no-base64") {
378  use_base64 = false;
379  } else if (arg == "-min-epochs" || arg == "--min-epochs") {
380  string err_msg = string("no arg specified with ") + arg;
381  if (!check_for_required_arg(argc, i, err_msg)) {
382  return -1;
383  }
384  min_epochs = atoi(argv[++i]);
385  } else if (arg == "-max-epochs" || arg == "--max-epochs") {
386  string err_msg = string("no arg specified with ") + arg;
387  if (!check_for_required_arg(argc, i, err_msg)) {
388  return -1;
389  }
390  max_epochs = atoi(argv[++i]);
391  } else if (arg == "-max-examples" || arg == "--max-examples") {
392  string err_msg = string("no arg specified with ") + arg;
393  if (!check_for_required_arg(argc, i, err_msg)) {
394  return -1;
395  }
396  max_examples = atoi(argv[++i]);
397  } else if (arg == "-max-candidates" || arg == "--max-candidates") {
398  string err_msg = string("no arg specified with ") + arg;
399  if (!check_for_required_arg(argc, i, err_msg)) {
400  return -1;
401  }
402  max_candidates = atoi(argv[++i]);
403  } else if (arg == "-r") {
404  string err_msg = string("no arg specified with ") + arg;
405  if (!check_for_required_arg(argc, i, err_msg)) {
406  return -1;
407  }
408  reporting_interval = atoi(argv[++i]);
409  } else if (arg.substr(0, use_weighted_loss_arg_prefix_len) ==
410  use_weighted_loss_arg_prefix) {
411  string use_weighted_loss_str;
412  if (arg.length() > use_weighted_loss_arg_prefix_len &&
413  arg[use_weighted_loss_arg_prefix_len] == '=') {
414  use_weighted_loss_str =
415  arg.substr(use_weighted_loss_arg_prefix_len + 1);
416  } else {
417  string err_msg =
418  string("no \"true\" or \"false\" arg specified with ") + arg;
419  if (!check_for_required_arg(argc, i, err_msg)) {
420  return -1;
421  }
422  use_weighted_loss_str = argv[++i];
423  }
424  if (use_weighted_loss_str != "true" &&
425  use_weighted_loss_str != "false") {
426  cerr << PROG_NAME << ": error: must specify \"true\" or \"false\""
427  << " with --use-weighted-loss" << endl;
428  usage();
429  return -1;
430  }
431  if (use_weighted_loss_str != "true") {
432  use_weighted_loss = false;
433  }
434  } else if (arg.size() > 0 && arg[0] == '-') {
435  cerr << PROG_NAME << ": error: unrecognized option: " << arg << endl;
436  usage();
437  return -1;
438  }
439  }
440 
441  bool training = training_files.size() > 0;
442 
443  // Check that user specified required args.
444  if (model_file == "") {
445  cerr << PROG_NAME << ": error: must specify model file" << endl;
446  usage();
447  return -1;
448  }
449  if (!mapper_mode && devtest_files.size() == 0) {
450  cerr << PROG_NAME << ": error: must specify devtest input files when "
451  << "not in mapper mode" << endl;
452  usage();
453  return -1;
454  }
455  if (output_file != "" && training) {
456  cerr << PROG_NAME << ": error: cannot specify output file when training"
457  << endl;
458  usage();
459  return -1;
460  }
461  if (hyp_output_file != "" && training) {
462  cerr << PROG_NAME
463  << ": error: cannot specify hypothesis output file when training"
464  << endl;
465  usage();
466  return -1;
467  }
468  bool reading_from_stdin = false;
469  for (vector<string>::const_iterator training_file_it = training_files.begin();
470  training_file_it != training_files.end();
471  ++training_file_it) {
472  if (*training_file_it == "-") {
473  reading_from_stdin = true;
474  break;
475  }
476  }
477  if (training_files.size() > 1 && reading_from_stdin) {
478  cerr << PROG_NAME << ": error: cannot read from standard input and "
479  << "specify other training files" << endl;
480  usage();
481  return -1;
482  }
483  if (!training && input_model_file != "") {
484  cerr << PROG_NAME << ": error: can only specify <input model file> "
485  << "when in training mode" << endl;
486  usage();
487  return -1;
488  }
489 
490  // Now, we finally get to the meat of the code for this executable.
491  if (training_feature_extractor_config_file != "") {
492  training_efe = ExecutiveFeatureExtractor::InitFromSpec(
493  training_feature_extractor_config_file);
494  }
495  if (devtest_feature_extractor_config_file != "") {
496  devtest_efe = ExecutiveFeatureExtractor::InitFromSpec(
497  devtest_feature_extractor_config_file);
498  }
499 
500  CandidateSetReader csr(max_examples, max_candidates, reporting_interval);
501  csr.set_verbosity(1);
502 
503  Factory<Model> model_factory;
504 
505  if (!training || input_model_file != "") {
506  // We're here because we're not training, or else we are training and
507  // the user specified an input model file.
508  string model_file_to_load = training ? input_model_file : model_file;
509 
510  ModelReader model_reader(1);
511  model = model_reader.Read(model_file_to_load, compressed, use_base64);
512  } else {
513  // First, see if model_config is the name of a file.
514  ifstream model_config_is(model_config.c_str());
515  if (model_config_is) {
516  cerr << "Reading model config from file \"" << model_config << "\"."
517  << endl;
518  }
519 
520  StreamTokenizer *st = model_config_is.good() ?
521  new StreamTokenizer(model_config_is) :
522  new StreamTokenizer(model_config);
523  model = model_factory.CreateOrDie(*st);
524  delete st;
525  }
526  if (model.get() == NULL) {
527  return -1;
528  }
529 
530  Factory<ModelProtoWriter> proto_writer_factory;
531  shared_ptr<ModelProtoWriter> model_writer =
532  proto_writer_factory.CreateOrDie(model->proto_writer_spec(),
533  "model proto writer");
534  if (model_writer.get() == NULL) {
535  return -1;
536  }
537 
538  if (!mapper_mode) {
539  model->set_end_of_epoch_hook(new EndOfEpochModelWriter(model_file,
540  model_writer,
541  compressed,
542  use_base64));
543  }
544  model->set_use_weighted_loss(use_weighted_loss);
545  model->set_min_epochs(min_epochs);
546  model->set_max_epochs(max_epochs);
547 
548  vector<shared_ptr<CandidateSet> > training_examples;
549  vector<shared_ptr<CandidateSet> > devtest_examples;
550  if (!streaming && !mapper_mode) {
551  cerr << "Loading devtest examples." << endl;
552  read_and_extract_features(devtest_files, csr, compressed, use_base64,
553  devtest_efe, devtest_examples);
554  if (devtest_examples.size() == 0) {
555  cerr << "Could not read any devtest examples. Exiting." << endl;
556  return -1;
557  }
558  }
559 
561  CandidateSetVectorIt;
562 
563  CandidateSetIterator *training_it;
564  CandidateSetIterator *devtest_it;
565 
566  if (training_files.size() > 0) {
567  cerr << "Training." << endl;
568  if (streaming) {
569  training_it = new MultiFileCandidateSetIterator(training_files,
570  training_efe,
571  max_examples,
572  max_candidates,
573  reporting_interval,
574  1,
575  compressed, use_base64);
576  devtest_it = new MultiFileCandidateSetIterator(devtest_files,
577  devtest_efe,
578  max_examples,
579  max_candidates,
580  reporting_interval,
581  1,
582  compressed, use_base64);
583  // TODO(dbikel): Make sure to add setter method to Model and
584  // PerceptronModel to tell model to invoke its
585  // CompactifyFeatureUids method after a specified
586  // interval. This new setter method should only
587  // be invoked here, when in streaming mode.
588  } else {
589  // Regular, in-memory, non-streaming training.
590  read_and_extract_features(training_files, csr, compressed, use_base64,
591  training_efe, training_examples);
592  if (training_examples.size() == 0) {
593  cerr << "Could not read any training examples from training files."
594  << " Exiting." << endl;
595  return -1;
596  }
597  csr.ClearStrings();
598 
599  training_it = new CandidateSetVectorIt(training_examples);
600  devtest_it = new CandidateSetVectorIt(devtest_examples);
601  }
602 
603  if (mapper_mode) {
604  // In mapper mode, train a single epoch, then write out features
605  // to stdout, and serialize model.
606  model->NewEpoch();
607  model->TrainOneEpoch(*training_it);
608  } else {
609  model->Train(*training_it, *devtest_it);
610  delete training_it;
611  delete devtest_it;
612  }
613 
614  if (compactify_feature_uids) {
615  cerr << "Compactifying feature uid's...";
616  cerr.flush();
617  model->CompactifyFeatureUids();
618  cerr << "done." << endl;
619  }
620 
621  // Serialize model.
622  cerr << "Writing out model to file \"" << model_file << "\"...";
623  cerr.flush();
624  confusion_learning::ModelMessage model_message;
625  model_writer->Write(model.get(), &model_message, false);
626 
627  ConfusionProtoIO* proto_writer;
628  if (mapper_mode) {
629  cerr << "Writing ModelMessage (without features) and FeatureMessage "
630  << "instances to standard output." << endl;
631  proto_writer = new ConfusionProtoIO(model_file, ConfusionProtoIO::WRITESTD,
632  false, use_base64);
633  cout << ModelInfoReducer::kModelMessageFeatureName << "\t";
634  } else {
635  proto_writer = new ConfusionProtoIO(model_file, ConfusionProtoIO::WRITE,
636  compressed, use_base64);
637  }
638  proto_writer->Write(model_message);
639  // Write out features.
640  bool output_best_epoch = !mapper_mode;
641  bool output_key = mapper_mode;
642  model_writer->WriteFeatures(model.get(),
643  *(proto_writer->outputstream()),
644  output_best_epoch,
645  model->num_training_errors(),
646  output_key);
647  delete proto_writer;
648  cerr << "done." << endl;
649  } else {
650  CandidateSetVectorIt devtest_examples_it(devtest_examples);
651  model->NewEpoch(); // sets epoch to 0
652  model->Evaluate(devtest_examples_it);
653 
654  if (output_file != "") {
655  CandidateSetWriter csw;
656  csw.set_verbosity(1);
657  csw.Write(devtest_examples, output_file, compressed, use_base64);
658  }
659  bool output_hyps = hyp_output_file != "";
660  bool output_scores = score_output_file != "";
661  if (output_hyps || output_scores) {
662  ofstream hyp_os(hyp_output_file.c_str());
663  ofstream score_os(score_output_file.c_str());
664  devtest_examples_it.Reset();
665  while (devtest_examples_it.HasNext()) {
666  CandidateSet &candidate_set = devtest_examples_it.Next();
667  if (output_hyps) {
668  hyp_os << candidate_set.GetBestScoring().raw_data() << "\n";
669  }
670  if (output_scores) {
671  for (CandidateSet::const_iterator cand_it = candidate_set.begin();
672  cand_it != candidate_set.end();
673  ++cand_it) {
674  score_os << (*cand_it)->score() << "\n";
675  }
676  }
677  }
678  if (output_hyps) {
679  hyp_os.flush();
680  }
681  if (output_scores) {
682  score_os.flush();
683  }
684  }
685  }
686  TearDown();
687  google::protobuf::ShutdownProtobufLibrary();
688 }
689 
Provides the reranker::PerceptronModel reranker class.
void ClearStrings()
Invokes CandidateSetProtoReader::ClearStrings on the internal CandidateSetProtoReader instance...
const char * usage_msg[]
Definition: run-model.C:82
Provides the reranker::Candidate class for representing a candidate hypothesis from an initial model...
An interface specifying iteration over CandidateSet instances, using Java-style semantics (sorry...
Provides the ModelReader class, which can create Model instances from a file.
An implementation of the CandidateSetIterator interface that iterates over CandidateSet instances tha...
A simple class for tokenizing a stream of tokens for the formally specified language used to construc...
shared_ptr< Model > Read(const string &filename, bool compressed, bool use_base64)
Definition: model-reader.H:59
const Candidate & GetBestScoring() const
Provides the reranker::Symbols interface as well as the reranker::StaticSymbolTable implementation...
#define DEFAULT_MODEL_CONFIG
Definition: run-model.C:67
const string & raw_data() const
Returns the raw data (typically the sentence) for this candidate.
Definition: candidate.H:143
void set_verbosity(int verbosity)
Sets the verbosity of this reader (mostly for debugging purposes).
void Eval(const string &filename)
Evaluates the statements in the specified text file.
Definition: interpreter.H:180
void Write(vector< shared_ptr< CandidateSet > > &examples, const string &filename, bool compressed, bool use_base64)
Writes a stream of CandidateSet instances to the specified file or to standard output.
void read_and_extract_features(const vector< string > &files, CandidateSetReader &csr, bool compressed, bool use_base64, shared_ptr< ExecutiveFeatureExtractor > efe, vector< shared_ptr< CandidateSet > > &examples)
Definition: run-model.C:171
#define XSTR(arg)
Expands the string value of the specified argument using the STR macro.
Definition: run-model.C:75
#define DEFAULT_REPORTING_INTERVAL
Definition: run-model.C:68
An end-of-epoch hook for writing out the best model so far to file after each epoch (if the best mode...
Factory for dynamically created instance of the specified type.
Definition: factory.H:396
void set_verbosity(int verbosity)
Sets the verbosity of this writer (mostly for debugging purposes).
A class for writing streams of training or test instances, where each training or test instance is a ...
Provides the reranker::ExecutiveFeatureExtractor class.
Class for reading streams of training or test instances, where each training or test instance is a re...
Reducer classes for trainer.
void TearDown()
A free-floating function (within the reranker namespace) that frees statically allocated objects...
Provides an interpreter for assigning primitives and Factory-constructible objects to named variables...
A class to hold a set of candidates, either for training or test.
Definition: candidate-set.H:62
bool check_for_required_arg(int argc, int i, string err_msg)
Definition: run-model.C:161
void Read(const string &filename, bool compressed, bool use_base64, bool reset_counters, vector< shared_ptr< CandidateSet > > &examples)
Reads a stream of CandidateSet instances from the specified file or from standard input...
shared_ptr< T > CreateOrDie(StreamTokenizer &st, Environment *env=NULL)
Dynamically creates an object, whose type and initialization are contained in a specification string...
Definition: factory.H:562
#define DEFAULT_COMPACTIFY_INTERVAL
Definition: run-model.C:69
Provides an interpreter for assigning primitives and Factory-constructible objects to named variables...
Definition: interpreter.H:165
#define DEFAULT_MAX_EXAMPLES
Definition: run-model.C:65
vector< shared_ptr< Candidate > >::const_iterator const_iterator
Definition: candidate-set.H:74
void usage()
Definition: run-model.C:153
#define PROG_NAME
Definition: run-model.C:63
#define DEFAULT_USE_WEIGHTED_LOSS
Definition: run-model.C:70
An implementation of the CandidateSetIterator interface that is backed by an arbitrary C++ collection...
const_iterator begin() const
Definition: candidate-set.H:86
int main(int argc, char **argv)
Definition: run-model.C:194
A class for reading streams of training or test instances, where each training or test instance is a ...
Interface for serializer for reranker::Model instances to ModelMessage instances. ...
Class for writing streams of training or test instances, where each training or test instance is a re...
const_iterator end() const
Definition: candidate-set.H:88
Class to hold a single training instance for a reranker, which is a set of examples, typically the n-best output of some input process, posibly including a gold-standard feature vector.
#define DEFAULT_MAX_CANDIDATES
Definition: run-model.C:66
Knows how to create Model instances that have been serialized to a file.
Definition: model-reader.H:55
Reranker model interface.
bool Get(const string &varname, T *value) const
Retrieves the value of the specified variable.
Definition: interpreter.H:217