Welcome to the Reranker Framework!

This package provides ways to train and use discriminative models in a reranking framework. There is some special handling for building discriminative language models.

Building and installation

Quick start

To build and install, run the following command sequence:

./configure ; make ; make install

Detailed instructions

Requirements:

autoconf 2.68 or higher
automake 1.11 or higher

Additional requirements are checked by the supplied configure script; they comprise:

a recent version of Python (v2.4 or higher)
Google Protocol Buffers
pkg-config (a requirement of Google Protocol Buffers)

Please make sure you have at least the preceding three packages installed prior to building the ReFr.

To build the Reranker Framework package, you must first run the supplied configure script. Please run

./configure --help

to see common options. In particular, you can use the –prefix option to specify the installation directory, which defaults to /usr/local/.

After running ./configure with any desired options, you can build the entire package by simply issuing the make command:

make

Installation of the package is completed by running

make install

Finally, there are a number of additional make targets supplied “for free” with the GNU autoconf build system, the most useful of which is

make clean

which clean the build directory and

make distclean

which cleans everything, including files auto-generated by the configure script.

What’s in the installation directory

Executables are in the bin subdirectory, and a library is built in the lib subdirectory. There are many unit test executables, but the two “real” binaries you’ll care about are:

bin/extract-features

and

bin/run-model

Extracting features

When dealing with feature extraction, you can work in an off-line mode, where you read in candidate hypotheses, extract features for those hypotheses and then write those back out to a file. You can also work in an on-line mode, where you train a model by reading in sets of candidates (hereafter referred to as “candidate sets”) where features for each candidate in each set are extracted “on the fly”. Finally, you can mix these two ways of working, as we’ll see below.

I/O

The Reranker framework uses protocol buffers for all low-level I/O. (See http://code.google.com/p/protobuf/ for more information.) In short, protocol buffers provide a way to serialize and de-serialize things that look a lot like C struct’s. You specify a protocol buffer in a format very familiar to C/C++/Java programmers. The protocol buffer definitions for the Reranker framework are all specified in two files

src/proto/data.proto

and

src/proto/model.proto

While you might be interested in perusing these files for your own edification, the Reranker framework has reader and writer classes that abstract away from this low-level representation.

Creating protocol buffer files

You do need to get some candidate sets into this protocol buffer format to begin working with the Reranker framework, and so to bootstrap the process, you can use the executables in the src/dataconvert directory (see the README file in that directory for example usage). The asr_nbest_proto executable can read sets of files in Brian Roark’s format, and the mt_nbest_proto can read files in Philipp Koehn’s format.

What if you have files that are not in either of those two formats? The answer is that you can easily construct your own CandidateSet instances in memory and write them out to file, serialized using their protocol buffer equivalent, CandidateSetMessage. The only requirements are that each CandidateSet needs to have a reference string, and each Candidate needs to have a baseline score, a loss value, a string consisting of its tokens and the number of its tokens. Here are the two methods:

Method 1: Batch

This method creates a sequence of CandidateSet in memory, pushing each into an STL std::vector, and then writes that vector out to disk. Below is a rough idea of what your code would look like. The following invents methods/functions for grabbing the data for each new CandidateSet and Candidate and instance, and assumes you want to output to the file named by the variable filename.

#include <vector>
#include <memory>
#include "candidate-set.H"
#include "candidate-set-writer.H"
...
using std::vector;
using std::shared_ptr;
...
vector<shared_ptr<CandidateSet> > candidate_sets;
while (there_are_more_candidate_sets()) {
   string reference = get_candidate_set_reference_string();
   shared_ptr<CandidateSet> candidate_set(new CandidateSet());
   for (int i = 0; i < number_of_candidates; ++i) {
     // Assemble the data for current Candidate, build it and add it.
     double loss = get_curr_candidate_loss();
     double baseline_score = get_curr_candidate_baseline_score();
     int num_tokens = get_curr_candidate_num_tokens();
     string raw_data = get_curr_candidate_string();
     shared_ptr<Candidate> candidate(new Candidate(i, loss, baseline_score,
                                                   num_tokens, raw_data));
     candidate_set->AddCandidate(candidate);
   }
   candidate_sets.push_back(candidate_set);
}
// Finally, write out entire vector of CandidateSet instances.
CandidateSetWriter candidate_set_writer;
bool compressed = true;
bool use_base64 = true;
candidate_set_writer.Write(candidate_sets, filename, compressed, use_base64);

Method 2: Serial

This method is nearly identical to Method 1, but does not try to assemble all CandidateSet into a single std::vector before writing them all out to disk.

#include <memory>
#include "candidate-set.H"
#include "candidate-set-writer.H"
...
using std::shared_ptr;
...
// Set up CandidateSetWriter to begin serial writing to file.
bool compressed = true;
bool use_base64 = true;
CandidateSetWriter candidate_set_writer;
candidate_set_writer.Open(filename, compressed, use_base64);
while (there_are_more_candidate_sets()) {
   string reference = get_candidate_set_reference_string();
   CandidateSet candidate_set;
   for (int i = 0; i < number_of_candidates; ++i) {
     // Assemble the data for current Candidate, build it and add it.
     double loss = get_curr_candidate_loss();
     double baseline_score = get_curr_candidate_baseline_score();
     int num_tokens = get_curr_candidate_num_tokens();
     string raw_data = get_curr_candidate_string();
     shared_ptr<Candidate> candidate(new Candidate(i, loss, baseline_score,
                                                   num_tokens, raw_data));
     candidate_set.AddCandidate(candidate);
   }
   // Serialize this newly constructed CandidateSet to file.
   candidate_set_writer.WriteNext(candidate_set);
}
candidate_set_writer.Close();

There’s a third, secret method for reading in candidate sets from arbitrary formats. You can build an implementation of the rather simple CandidateSetIterator interface, which is what the Model interface uses to iterate over a sequence of candidate sets during training or decoding. With this approach, your data never gets stored as protocol buffer messages. Given the utility of storing information in protocol buffers, however, we strongly advise against using this method.

Classes

If you want to extract features, there are just four classes in the Reranker framework you’ll want to know about:

Class name	Brief description
Candidate	Describes a candidate hypothesis put forth by a baseline model for some problem instance (e.g., a sentence in the case of MT or an utterance in the case of speech recognition).
CandidateSet	A set of candidate hypotheses for a single problem instance.
FeatureVector	A mapping from feature uid’s (either `string`’s or `int`’s) to their values (`double`’s).
FeatureExtractor	An interface/abstract base class that you will extend to write your own feature extractors.

Building a FeatureExtractor

To build your own FeatureExtractor, follow these steps:

Create a class that derives from FeatureExtractor.
(optional) Override the FeatureExtractor::RegisterInitializers method in case your FeatureExtractor needs to set certain of its data members when constructed by a Factory. Also, one may override the FeatureExtractor::Init method if the feature extractor requires more object initialization after its data members have been initialized. See Appendix: Dynamic object instantiation for more information about how various objects are constructed via Factory instances.
Register your FeatureExtractor using the REGISTER_FEATURE_EXTRACTOR macro. This is also required to be able to construct your FeatureExtractor by the Factory class.
Implement either the FeatureExtractor::Extract or the FeatureExtractor::ExtractSymbolic method. See below for more information about implementing these methods.

See example-feature-extractor.H for a fully-functional (albeit boring) FeatureExtractor implementation. Please note that normally, one would use the REGISTER_FEATURE_EXTRACTOR macro in one’s feature extractor’s .C file, but for the ExampleFeatureExtractor this is done in the .H to keep things simple.

As mentioned in Step 4 above, in most cases, you’ll either want to provide an implementation for the FeatureExtractor::Extract or the FeatureExtractor::ExtractSymbolic methods, but not both. In fact, you’ll most likely just want to implement ExtractSymbolic, which allows you to extract features that are string’s that map to double values. (Since both methods are pure virtual in the FeatureExtractor definition, you’ll have to implement both, but either—or both—can be implemented to do nothing.)

When implementing the FeatureExtractor::ExtractSymbolic method, you will normally modify just the second parameter, which is a reference to a FeatureVector<string, double>. You’ll typically modify it using the FeatureVector::IncrementWeight method. (There’s also a FeatureVector::SetWeight method, but that will blow away any existing weight for the specified feature, and so it should not normally be used.)

Extracting Features (Finally!)

If you want to extract features for an existing file containing serialized CandidateSet instances (again, for now, created using the tools in src/dataconvert), you can write a short configuration file that specifies the construction of an ExecutiveFeatureExtractor instance. This Factory-constructible object contains a single data member, extractors, that is initialized with a vector of FeatureExtractor implementations. Each FeatureExtractor will be executed on each Candidate of each CandidateSet.

An example of such a configuration file is test-fe.config in the directory src/reranker/config. The format of a feature extractor configuration file should be a specification string for constructing an ExecutiveFeatureExtractor instance, with a sequence of specification strings for each of its wrapped FeatureExtractor instances. Please see Appendix: Dynamic object instantiation for more details on factories and the ability to construct objects from specification strings. (For a formal, BNF description of the format of a specification string, please see the documentation for the reranker::Factory::CreateOrDie method.)

You can then pass this configuration file, along with one or more input files and an output directory, to the bin/extract-features executable. The executable will read the CandidateSet instances from the input files, and, for each Candidate in each CandidateSet, will run the FeatureExtractor’s specified in the config file, in order, on that Candidate. The “in order” part is significant, if, e.g., you have a FeatureExtractor implementation that expressly uses features generated by a previously-run FeatureExtractor.

You can execute extract-features with no arguments to get the usage. Here’s what your command will look like

extract-features -c <config file> -i <input file>+ -o <output directory>

Your input files will each be read in, new features will be extracted and then the resulting, modified streams of CandidateSet objects will be written out to a file of the same name in the <output directory>.

Bonus subsection: extracting features already sitting in a file

If you generate features offline as a text file, where each line corresponds to the features for a single candidate hypothesis, you’re in luck. There’s an abstract base class called AbstractFileBackedFeatureExtractor that you can extend to implement a FeatureExtractor that doesn’t do any real work, but rather uses whatever it finds in its “backing file”. In fact, there's already a concrete implementation in the form of BasicFileBackedFeatureExtractor, so you might be able to use that class “as is”.

Since the feature extractor configuration file lets you specify any sequence of FeatureExtractor instances, you can mix and match, using some FeatureExtractor’s that are truly computing feature functions “on the fly”, and others that are simply reading in pre-computed features sitting in a file.

Training a model

To train a model, you’ll run the bin/run-model executable, which does both model training and inference on test data.

Here’s a sample command:

bin/run-model -t train_file1.gz train_file2.gz -d dev_file1.gz dev_file2.gz -m model_output_file.gz

This builds a model based on the serialized CandidateSet’s in train_file1.gz and train_file2.gz, using dev_file1.gz and dev_file2.gz for held-out evaluation (as a stopping criterion for the perceptron), outputting the model to model.gz. As with the bin/extract-features executable, running bin/run-model with no arguments prints out a detailed usage message.

Two options that are common to both the bin/extract-features and bin/run-model executables are worth mentioning:

The –max-examples option specifies the maximum number of training examples (i.e., CandidateSet instances) to be read from each input file.
The –max-candidates option specifies the maxiumum number of candidates to be read per CandidateSet . So, even if your input file contains, say, 1000-best hypothesis sets, you can effectively turn them into 100-best hypothesis sets by specifying
--max-candidates 100

on the command line.

Running a model

So you’ve trained a model and saved it to a file. Now what? To run a model on some data (i.e. to do inference), use the bin/run-model executable and supply the same command-line arguments as you would for training, except omit the training files that you would specify with the -t flag. In this mode, the model file specified with the -m flag is the name of the file from which to load a model that had been trained previously. That model will then be run on the “dev” data files you supply with the -d flag.

A command might look like this:

bin/run-model -d dev_file1.gz dev_file2.gz -m model_input_file.gz

Appendix: Dynamic object instantiation

There’s a famous quotation of Philip Greenspun known as Greenspun’s Tenth Rule:

Greenspun’s Tenth Rule: Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

This statement is remarkably true in practice, and no less so here. C++ lacks convenient support for dynamic object instantiation, but the Reranker Framework uses it extensively via a Factory class and a C++-style (yet simple) syntax.

An example: The way C++ does it

To motivate the C++-style syntax used by the Reranker Framework’s Factory class, let’s look at a simple example of a C++ class Person and its constructor:

// A class to represent a date in the standard Gregorian calendar.
class Date {
 public:
   Date(int year, int month, int day) :
     year_(year), month_(month), day_(day) { }
 private:
   int year_;
   int month_;
   int day_;
};
// A class to represent a few facts about a person.
class Person {
 public:
   Person(const string &name, int cm_height, const Date &birthday) :
     name_(name), cm_height_(cm_height), birthday_(birthday) { }
 private:
  string name_;
  int cm_height_;
  Date birthday_;
};

As you can see, the Person class has three data members, one of which happens to be an instance of another class called Date. In this case, all of the initialization of a Person happens in the initialization phase of the constructor—the part after the colon but before the declaration phase block. By convention, each parameter to the constructor has a name nearly identical to the data member that will be initialized from it. If we wanted to construct a Person instance for someone named “Fred” who was 180 cm tall and was born January 10th, 1990, we could write the following:

Person fred("Fred", 180, Date(1990, 1, 10));

If Person were a Factory-constructible type in the Reranker Framework, we would be able to specify the following as a specification string to tell the Factory how to construct a Person instance for Fred:

Person(name("Fred"), cm_height(180), birthday(Date(year(1990), month(1), day(10))))

As you can see, the syntax is very similar to that of C++. It’s kind of a combination of the parameter list and the initialization phase of a C++ constructor. Unfortunately, we can’t get this kind of dynamic instantiation in C++ for free; we need some help from the programmer. However, we’ve tried to make the burden on the programmer fairly low, using just a couple of macros to help declare a Factory for an abstract base class, as well as to make it easy to make that Factory aware of the concrete subtypes of that base class that it can construct.

Some nitty gritty details: declaring factories for abstract types and registering concrete subtypes

Every Factory-constructible abstract type needs to declare its factory via the IMPLEMENT_FACTORY macro. For example, since the Reranker Framework uses a Factory to construct concrete instances of the abstract type FeatureExtractor , the line

IMPLEMENT_FACTORY(FeatureExtractor)

appears in the file feature-extractor.C. (It is unfortunate that we have to resort to using macros, but the point is that the burden on the programmer to create a factory is extremely low, and therefore so is the risk of introducing bugs.)

By convention every Factory-constructible abstract type defines one or two macros in terms of the REGISTER_NAMED macro defined in factory.H to allow concrete subtypes to register themselves with the Factory, so that they may be instantiated. For example, since the FeatureExtractor class is an abstract base class in the Reranker Framework that has a Factory, in feature-extractor.H you can find the declaration of two macros, REGISTER_NAMED_FEATURE_EXTRACTOR and REGISTER_FEATURE_EXTRACTOR. The NgramFeatureExtractor class is a concrete subclass of FeatureExtractor, and so it registers itself with Factory<FeatureExtractor> by having

REGISTER_FEATURE_EXTRACTOR(NgramFeatureExtractor)

in ngram-feature-extractor.C. That macro expands to

REGISTER_NAMED(NgramFeatureExtractor, NgramFeatureExtractor,

FeatureExtractor)

which tells the Factory<FeatureExtractor> that there is a class NgramFeatureExtractor whose “factory name” (the string that can appear in specification strings—more on these in a moment) is "NgramFeatureExtractor" and that the class NgramFeatureExtractor is a concrete subclass of FeatureExtractor , i.e., that it can be constructed by Factory<FeatureExtractor>, as opposed to some other Factory for a different abstract base class.

Every Factory-constructible abstract type must also specify two methods, a RegisterInitializers(Initializers&) method and an Init(const string&) method. Both methods are guaranteed to be invoked, in order, just after construction of every object by the Factory. To reduce the burden on the programmer, you can derive your abstract class from FactoryConstructible , which implements both methods to do nothing. (All of the abstract base classes that can be constructed via Factory in the Reranker Framework already do this.) For most concrete subtypes, most of the work of initialization is done inside the factory to initialize registered data members, handled by the class’s RegisterInitializers(Initializers&) method. The implementation of this method generally contains a set of invocations to the various Add methods of the Initializers class, “registering” each variable with a name that will be recognized by the Factory when it parses the specification string. When member initializations are added to an Initializers instance, they are optional by default. By including a third argument that is true, one may specify a member whose initialization string must appear within the specification. If it does not contain it, a runtime error will be raised.

For completeness, post–member-initialization may be performed by the class’s Init(const string &) method, which is guaranteed to be invoked with the complete string that was parsed by the Factory. The code executed by a class’ Init(cosnt string &) method is very much akin to the declaration phase of a C++ constructor, because it is the code that gets executed just after the members have been initialized.

For example, FeatureExtractor instances are Factory-constructible, and so the FeatureExtractor class ensures its concrete subclasses have a RegisterInitializers method and an Init method by being a subclass of reranker::FactoryConstructible. As we saw above, NgramFeatureExtractor is a concrete subtype of FeatureExtractor . That class has two data members that can be initialized by a factory, one required and one optional. To show you how easy it is to “declare” data members that need initialization, here is the exact code from the NgramFeatureExtractor::RegisterInitializers method:

virtual void RegisterInitializers(Initializers &initializers) {
  bool required = true;
  initializers.Add("n",      &n_, required);
  initializers.Add("prefix", &prefix_);
}

The above code says that the NgramFeatureExtractor has a data member n_, which happens to be an int, that is required to be initialized when an NgramFeatureExtractor instance is constructed by a Factory, and that the name of this variable will be "n" as far as the factory is concerned. It also says that it has a data member prefix_, which happens to be of type string, whose factory name will be "prefix", and that is not required to be present in a specification string for an NgramFeatureExtractor.

The Factory language

As we’ve seen, the language used to instantiate objects is quite simple. An object is constructed via a specification string of the following form:

RegisteredClassName(member1(init1), member2(init2), ...)

where RegisteredClassName is the concrete subtype’s name specified with the REGISTER_NAMED macro (or, more likely, one of the convenience macros that is “implemented” in terms of the REGISTER_NAMED macro, such as REGISTER_MODEL or REGISTER_FEATURE_EXTRACTOR). The comma-separated list inside the outermost set of parentheses is the set of member initializations, which looks, as we saw above, intentionally similar to the format of a C++ constructor’s initialization phase. The names of class members that can be initialized are specified via repeated invocations of the various overloaded reranker::Initializers Add methods. There is essentially one Add method per primitive C++ type, as well as an Add method for Factory-constructible types.

If you love Backus-Naur Form specifications, please see the documentation for the Factory::CreateOrDie method for the formal description of the grammar for specification strings.

To continue our example with NgramFeatureExtractor , the following are all legal specification strings for constructing NgramFeatureExtractor instances:

NgramFeatureExtractor(n(3))
NgramFeatureExtractor(n(2), prefix("foo:"))
NgramFeatureExtractor(prefix("bar"), n(4))
NgramFeatureExtractor(n(2),)

As you can see, the order of member initializers is not important (because each has a unique name), and you can optionally put a comma after the last initializer. The following are illegal specification strings for NgramFeatureExtractor instances:

// Illegal specification strings:
NgramFeatureExtractor(prefix("foo"))
NgramFeatureExtractor()
NgramFeatureExtractor(n(3), prefix(4))

In the first two cases, the specification strings are missing the required variable n, and in the final case, the optional prefix member is being initialized, but with an int literal instead of a string literal.

In most cases, you will never need to directly use a Factory instance, but they are often at work behind the scenes. For example, every Model instance uses a factory to construct its internal Candidate::Comparator instances that it uses to determine the “gold” and top-scoring candidates when training. In fact, the specification strings for constructing Model instances reveal how an init_string can itself contain other specification strings. For example, to construct a PerceptronModel instance with a DirectLossScoreComparator , you’d use the following specification string:

PerceptronModel(name("MyPerceptronModel"), score_comparator(DirectLossScoreComparator()))

The first member initialization, for the member called name, specifies the unique name you can give to each Model instance (which is strictly for human consumption). The second member initialization, for the member called score_comparator, overrides the default Candidate::Comparator used to compare candidate scores, and illustrates how this simple language is recursive, in that specification strings may contain other specification strings for other Factory-constructible objects.

Putting it all together

Here is a template illustrating how one creates a Factory for an abstract base class called “Abby” and declares a concrete subtype “Concky” to that Factory. Most users of the Reranker Framework are likely only to build concrete subtypes of abstract classes that already have factories, and so those users can safely ignore the abby.H and abby.C files.

abby.H
#include "factory.H"

class Abby : public FactoryConstructible {

// .. the code for Abby ...

};

#define REGISTER_NAMED_ABBY(TYPE,NAME) REGISTER_NAMED(TYPE,NAME,Abby)

#define REGISTER_ABBY(TYPE) REGISTER_NAMED_ABBY(TYPE,TYPE)
abby.C
IMPLEMENT_FACTORY(Abby)
concky.H
#include "abby.H"

class Concky : public Abby {

public:

virtual void RegisterInitializers(Initialiizers &initializers) {

// various calls to the overloaded Initializers::Add methods,

// one per data member that the Factory can initialize

}

};
concky.C
REGISTER_ABBY(Concky)

So what about Greenspun’s Tenth Rule? Well, the idea that initialization strings can themselves contain specification strings suggests that there is a full-blown language being interpreted here, complete with a proper tokenizer and a recursive-descent parser. There is. It is a simple language, and one that is formally specified. To the extent that it mirrors the way C++ does things, it is not quite ad hoc; rather, it is (close to being) an exceedingly small subset of C++ that can be executed dynamically. We hope it is not bug-ridden, but we’ll let you, the user, be the judge of that.