Reranker Framework (ReFr)
Reranking framework for structure prediction and discriminative language modeling
|
This package provides ways to train and use discriminative models in a reranking framework. There is some special handling for building discriminative language models.
To build and install, run the following command sequence:
Requirements:
autoconf
2.68 or higher automake
1.11 or higher Additional requirements are checked by the supplied configure
script; they comprise:
pkg-config
(a requirement of Google Protocol Buffers) Please make sure you have at least the preceding three packages installed prior to building the ReFr.
To build the Reranker Framework package, you must first run the supplied configure
script. Please run
to see common options. In particular, you can use the –prefix
option to specify the installation directory, which defaults to /usr/local/
.
After running ./configure
with any desired options, you can build the entire package by simply issuing the make command:
Installation of the package is completed by running
Finally, there are a number of additional make targets supplied “for free” with the GNU autoconf build system, the most useful of which is
which clean the build directory and
which cleans everything, including files auto-generated by the configure
script.
Executables are in the bin
subdirectory, and a library is built in the lib
subdirectory. There are many unit test executables, but the two “real” binaries you’ll care about are:
and
When dealing with feature extraction, you can work in an off-line mode, where you read in candidate hypotheses, extract features for those hypotheses and then write those back out to a file. You can also work in an on-line mode, where you train a model by reading in sets of candidates (hereafter referred to as “candidate sets”) where features for each candidate in each set are extracted “on the fly”. Finally, you can mix these two ways of working, as we’ll see below.
The Reranker framework uses protocol buffers for all low-level I/O. (See http://code.google.com/p/protobuf/ for more information.) In short, protocol buffers provide a way to serialize and de-serialize things that look a lot like C struct
’s. You specify a protocol buffer in a format very familiar to C/C++/Java programmers. The protocol buffer definitions for the Reranker framework are all specified in two files
and
While you might be interested in perusing these files for your own edification, the Reranker framework has reader and writer classes that abstract away from this low-level representation.
You do need to get some candidate sets into this protocol buffer format to begin working with the Reranker framework, and so to bootstrap the process, you can use the executables in the src/dataconvert
directory (see the README
file in that directory for example usage). The asr_nbest_proto
executable can read sets of files in Brian Roark’s format, and the mt_nbest_proto
can read files in Philipp Koehn’s format.
What if you have files that are not in either of those two formats? The answer is that you can easily construct your own CandidateSet instances in memory and write them out to file, serialized using their protocol buffer equivalent, CandidateSetMessage
. The only requirements are that each CandidateSet needs to have a reference string, and each Candidate needs to have a baseline score, a loss value, a string consisting of its tokens and the number of its tokens. Here are the two methods:
This method creates a sequence of CandidateSet in memory, pushing each into an STL std::vector
, and then writes that vector out to disk. Below is a rough idea of what your code would look like. The following invents methods/functions for grabbing the data for each new CandidateSet and Candidate and instance, and assumes you want to output to the file named by the variable filename
.
This method is nearly identical to Method 1, but does not try to assemble all CandidateSet into a single std::vector
before writing them all out to disk.
There’s a third, secret method for reading in candidate sets from arbitrary formats. You can build an implementation of the rather simple CandidateSetIterator interface, which is what the Model interface uses to iterate over a sequence of candidate sets during training or decoding. With this approach, your data never gets stored as protocol buffer messages. Given the utility of storing information in protocol buffers, however, we strongly advise against using this method.
If you want to extract features, there are just four classes in the Reranker framework you’ll want to know about:
Class name | Brief description |
---|---|
Candidate | Describes a candidate hypothesis put forth by a baseline model for some problem instance (e.g., a sentence in the case of MT or an utterance in the case of speech recognition). |
CandidateSet | A set of candidate hypotheses for a single problem instance. |
FeatureVector | A mapping from feature uid’s (either string ’s or int ’s) to their values (double ’s). |
FeatureExtractor | An interface/abstract base class that you will extend to write your own feature extractors. |
To build your own FeatureExtractor, follow these steps:
See example-feature-extractor.H for a fully-functional (albeit boring) FeatureExtractor implementation. Please note that normally, one would use the REGISTER_FEATURE_EXTRACTOR macro in one’s feature extractor’s .C
file, but for the ExampleFeatureExtractor this is done in the .H
to keep things simple.
As mentioned in Step 4 above, in most cases, you’ll either want to provide an implementation for the FeatureExtractor::Extract or the FeatureExtractor::ExtractSymbolic methods, but not both. In fact, you’ll most likely just want to implement ExtractSymbolic, which allows you to extract features that are string
’s that map to double
values. (Since both methods are pure virtual in the FeatureExtractor definition, you’ll have to implement both, but either—or both—can be implemented to do nothing.)
When implementing the FeatureExtractor::ExtractSymbolic method, you will normally modify just the second parameter, which is a reference to a FeatureVector<string, double>. You’ll typically modify it using the FeatureVector::IncrementWeight method. (There’s also a FeatureVector::SetWeight method, but that will blow away any existing weight for the specified feature, and so it should not normally be used.)
If you want to extract features for an existing file containing serialized CandidateSet instances (again, for now, created using the tools in src/dataconvert
), you can write a short configuration file that specifies the construction of an ExecutiveFeatureExtractor instance. This Factory-constructible object contains a single data member, extractors
, that is initialized with a vector of FeatureExtractor implementations. Each FeatureExtractor will be executed on each Candidate of each CandidateSet.
An example of such a configuration file is test-fe.config
in the directory src/reranker/config
. The format of a feature extractor configuration file should be a specification string for constructing an ExecutiveFeatureExtractor instance, with a sequence of specification strings for each of its wrapped FeatureExtractor instances. Please see Appendix: Dynamic object instantiation for more details on factories and the ability to construct objects from specification strings. (For a formal, BNF description of the format of a specification string, please see the documentation for the reranker::Factory::CreateOrDie method.)
You can then pass this configuration file, along with one or more input files and an output directory, to the bin/extract-features
executable. The executable will read the CandidateSet instances from the input files, and, for each Candidate in each CandidateSet, will run the FeatureExtractor’s specified in the config file, in order, on that Candidate. The “in order” part is significant, if, e.g., you have a FeatureExtractor implementation that expressly uses features generated by a previously-run FeatureExtractor.
You can execute extract-features
with no arguments to get the usage. Here’s what your command will look like
Your input files will each be read in, new features will be extracted and then the resulting, modified streams of CandidateSet objects will be written out to a file of the same name in the <output directory>
.
If you generate features offline as a text file, where each line corresponds to the features for a single candidate hypothesis, you’re in luck. There’s an abstract base class called AbstractFileBackedFeatureExtractor that you can extend to implement a FeatureExtractor that doesn’t do any real work, but rather uses whatever it finds in its “backing file”. In fact, there's already a concrete implementation in the form of BasicFileBackedFeatureExtractor, so you might be able to use that class “as is”.
Since the feature extractor configuration file lets you specify any sequence of FeatureExtractor instances, you can mix and match, using some FeatureExtractor’s that are truly computing feature functions “on the fly”, and others that are simply reading in pre-computed features sitting in a file.
To train a model, you’ll run the bin/run-model
executable, which does both model training and inference on test data.
Here’s a sample command:
This builds a model based on the serialized CandidateSet’s in train_file1.gz
and train_file2.gz
, using dev_file1.gz
and dev_file2.gz
for held-out evaluation (as a stopping criterion for the perceptron), outputting the model to model.gz
. As with the bin/extract-features
executable, running bin/run-model
with no arguments prints out a detailed usage message.
Two options that are common to both the bin/extract-features
and bin/run-model
executables are worth mentioning:
–max-examples
option specifies the maximum number of training examples (i.e., CandidateSet instances) to be read from each input file. –max-candidates
option specifies the maxiumum number of candidates to be read per CandidateSet . So, even if your input file contains, say, 1000-best hypothesis sets, you can effectively turn them into 100-best hypothesis sets by specifying So you’ve trained a model and saved it to a file. Now what? To run a model on some data (i.e. to do inference), use the bin/run-model
executable and supply the same command-line arguments as you would for training, except omit the training files that you would specify with the -t
flag. In this mode, the model file specified with the -m
flag is the name of the file from which to load a model that had been trained previously. That model will then be run on the “dev” data files you supply with the -d
flag.
A command might look like this:
There’s a famous quotation of Philip Greenspun known as Greenspun’s Tenth Rule:
This statement is remarkably true in practice, and no less so here. C++ lacks convenient support for dynamic object instantiation, but the Reranker Framework uses it extensively via a Factory class and a C++-style (yet simple) syntax.
To motivate the C++-style syntax used by the Reranker Framework’s Factory class, let’s look at a simple example of a C++ class Person
and its constructor:
As you can see, the Person
class has three data members, one of which happens to be an instance of another class called Date
. In this case, all of the initialization of a Person
happens in the initialization phase of the constructor—the part after the colon but before the declaration phase block. By convention, each parameter to the constructor has a name nearly identical to the data member that will be initialized from it. If we wanted to construct a Person
instance for someone named “Fred” who was 180 cm tall and was born January 10th, 1990, we could write the following:
If Person
were a Factory-constructible type in the Reranker Framework, we would be able to specify the following as a specification string to tell the Factory how to construct a Person
instance for Fred:
As you can see, the syntax is very similar to that of C++. It’s kind of a combination of the parameter list and the initialization phase of a C++ constructor. Unfortunately, we can’t get this kind of dynamic instantiation in C++ for free; we need some help from the programmer. However, we’ve tried to make the burden on the programmer fairly low, using just a couple of macros to help declare a Factory for an abstract base class, as well as to make it easy to make that Factory aware of the concrete subtypes of that base class that it can construct.
Every Factory-constructible abstract type needs to declare its factory via the IMPLEMENT_FACTORY macro. For example, since the Reranker Framework uses a Factory to construct concrete instances of the abstract type FeatureExtractor , the line
appears in the file feature-extractor.C
. (It is unfortunate that we have to resort to using macros, but the point is that the burden on the programmer to create a factory is extremely low, and therefore so is the risk of introducing bugs.)
By convention every Factory-constructible abstract type defines one or two macros in terms of the REGISTER_NAMED macro defined in factory.H to allow concrete subtypes to register themselves with the Factory, so that they may be instantiated. For example, since the FeatureExtractor class is an abstract base class in the Reranker Framework that has a Factory, in feature-extractor.H you can find the declaration of two macros, REGISTER_NAMED_FEATURE_EXTRACTOR and REGISTER_FEATURE_EXTRACTOR. The NgramFeatureExtractor class is a concrete subclass of FeatureExtractor, and so it registers itself with Factory<FeatureExtractor> by having
in ngram-feature-extractor.C
. That macro expands to
which tells the Factory<FeatureExtractor> that there is a class NgramFeatureExtractor
whose “factory name” (the string that can appear in specification strings—more on these in a moment) is "NgramFeatureExtractor"
and that the class NgramFeatureExtractor is a concrete subclass of FeatureExtractor , i.e., that it can be constructed by Factory<FeatureExtractor>
, as opposed to some other Factory
for a different abstract base class.
Every Factory-constructible abstract type must also specify two methods, a RegisterInitializers(Initializers&)
method and an Init(const string&)
method. Both methods are guaranteed to be invoked, in order, just after construction of every object by the Factory. To reduce the burden on the programmer, you can derive your abstract class from FactoryConstructible , which implements both methods to do nothing. (All of the abstract base classes that can be constructed via Factory in the Reranker Framework already do this.) For most concrete subtypes, most of the work of initialization is done inside the factory to initialize registered data members, handled by the class’s RegisterInitializers(Initializers&)
method. The implementation of this method generally contains a set of invocations to the various Add
methods of the Initializers class, “registering” each variable with a name that will be recognized by the Factory when it parses the specification string. When member initializations are added to an Initializers instance, they are optional by default. By including a third argument that is true
, one may specify a member whose initialization string must appear within the specification. If it does not contain it, a runtime error will be raised.
For completeness, post–member-initialization may be performed by the class’s Init(const string &)
method, which is guaranteed to be invoked with the complete string that was parsed by the Factory. The code executed by a class’ Init(cosnt string &)
method is very much akin to the declaration phase of a C++ constructor, because it is the code that gets executed just after the members have been initialized.
For example, FeatureExtractor instances are Factory-constructible, and so the FeatureExtractor class ensures its concrete subclasses have a RegisterInitializers method and an Init method by being a subclass of reranker::FactoryConstructible. As we saw above, NgramFeatureExtractor is a concrete subtype of FeatureExtractor . That class has two data members that can be initialized by a factory, one required and one optional. To show you how easy it is to “declare” data members that need initialization, here is the exact code from the NgramFeatureExtractor::RegisterInitializers method:
The above code says that the NgramFeatureExtractor has a data member n_
, which happens to be an int
, that is required to be initialized when an NgramFeatureExtractor instance is constructed by a Factory, and that the name of this variable will be "n"
as far as the factory is concerned. It also says that it has a data member prefix_
, which happens to be of type string
, whose factory name will be "prefix"
, and that is not required to be present in a specification string for an NgramFeatureExtractor.
As we’ve seen, the language used to instantiate objects is quite simple. An object is constructed via a specification string of the following form:
where RegisteredClassName
is the concrete subtype’s name specified with the REGISTER_NAMED macro (or, more likely, one of the convenience macros that is “implemented” in terms of the REGISTER_NAMED macro, such as REGISTER_MODEL or REGISTER_FEATURE_EXTRACTOR). The comma-separated list inside the outermost set of parentheses is the set of member initializations, which looks, as we saw above, intentionally similar to the format of a C++ constructor’s initialization phase. The names of class members that can be initialized are specified via repeated invocations of the various overloaded reranker::Initializers Add
methods. There is essentially one Add
method per primitive C++ type, as well as an Add
method for Factory-constructible types.
If you love Backus-Naur Form specifications, please see the documentation for the Factory::CreateOrDie method for the formal description of the grammar for specification strings.
To continue our example with NgramFeatureExtractor , the following are all legal specification strings for constructing NgramFeatureExtractor instances:
As you can see, the order of member initializers is not important (because each has a unique name), and you can optionally put a comma after the last initializer. The following are illegal specification strings for NgramFeatureExtractor instances:
In the first two cases, the specification strings are missing the required variable n
, and in the final case, the optional prefix
member is being initialized, but with an int
literal instead of a string
literal.
In most cases, you will never need to directly use a Factory instance, but they are often at work behind the scenes. For example, every Model instance uses a factory to construct its internal Candidate::Comparator instances that it uses to determine the “gold” and top-scoring candidates when training. In fact, the specification strings for constructing Model instances reveal how an init_string
can itself contain other specification strings. For example, to construct a PerceptronModel instance with a DirectLossScoreComparator , you’d use the following specification string:
The first member initialization, for the member called name
, specifies the unique name you can give to each Model instance (which is strictly for human consumption). The second member initialization, for the member called score_comparator
, overrides the default Candidate::Comparator used to compare candidate scores, and illustrates how this simple language is recursive, in that specification strings may contain other specification strings for other Factory-constructible objects.
Here is a template illustrating how one creates a Factory for an abstract base class called “Abby
” and declares a concrete subtype “Concky
” to that Factory. Most users of the Reranker Framework are likely only to build concrete subtypes of abstract classes that already have factories, and so those users can safely ignore the abby.H
and abby.C
files.
abby.H
abby.C
concky.H
concky.C
So what about Greenspun’s Tenth Rule? Well, the idea that initialization strings can themselves contain specification strings suggests that there is a full-blown language being interpreted here, complete with a proper tokenizer and a recursive-descent parser. There is. It is a simple language, and one that is formally specified. To the extent that it mirrors the way C++ does things, it is not quite ad hoc; rather, it is (close to being) an exceedingly small subset of C++ that can be executed dynamically. We hope it is not bug-ridden, but we’ll let you, the user, be the judge of that.