Sep 09, 2010 21:38
Possibly because so little natural language research is done in the context of real computational applications, it is not universally accepted that a language model must be able to both parse and generate utterances. When you think about programming a system which includes both parsing and generation modules, however, there is a strong incentive to design a single language model rather than to design two: the former approach would appear to involve less work.
The tasks of parsing and generating are, however, very different tasks. The task of generation is heavily constrained, in the sense that the system definitely has access to all of the information that needs to be communicated. The system's extensively detailed understanding of what needs to be communicated (semantic instance) has to be abstracted and pruned down to a relatively compact form (utterance). Although there may be a large or infinite number of potential utterances that would satisfy the system's functional requirements for a given semantic instance, the system only has to generate one of them.
By contrast, the task of parsing frequently appears under-constrained. No matter how sophisticated the system is, its designer cannot guarantee that users of the system will always provide input that unambiguously maps to a relevant semantic instance. Moreover, the range of potential utterances that the system must be able to parse may be far greater than some minimal set required to express all potential semantic instances.
When generating, a system maps an exact semantic instance onto a small domain of utterances. When parsing, a system maps from a large domain of utterances into a probabilistic distribution of semantic instances. This asymmetry has been used to justify the split between generation and parsing modules: generation for many applications can be accomplished by a strict logical system of utterance templates; parsing is an open problem which almost certainly must involve sophisticated machine learning techniques. These differences, however, rather than obstructing the development of a unified language model, may assist in it.
Consider the problem of an application which requires a natural language interface. The most laborious aspect of implementing this interface is not to design the data structures and algorithms required for the language model, but rather to populate the utterance vs. semantic mapping. In the case of generation by templates, all of the templates must be identified and input into the system. In the case of parsing statistically, the system must be trained on a corpus of labeled training data (and most concerning: someone needs to create the labeled training data). Each training instance may include not only an utterance and a semantic instance, but also the derivation steps linking the two.
Assuming that you have a set of generation templates, perhaps it is not all that difficult to bootstrap the parsing model, without manually labeling training data: have the generation model generate a random corpus (with derivations), then train the parsing model on that. Although the resulting parsing model will be most accurate on utterances covered by the generation model, one should hope that small deviations from this restricted domain of utterances will also be parsable.
Once the parsing model is trained, it may be possible to make incremental extensions to the generation templates without having to hand-code them. Provide the parser with semantic instance/utterance pairs that are almost covered by the generator. The parser, being statistical, will likely find a derivation for the utterance which matches a slightly different semantic instance. If the semantic differences are small enough, the system may be able to infer that one of the derivation steps could be replaced with a similar derivation step having a different semantic mapping. This new template could be added to the generator, extending both the generator's capabilities, and the random corpus used for parser training.
Although it is likely no faster to hand-craft a semantic instance/utterance pair than it is to hand-craft a template, it seems more plausible that a system could at run-time encounter such a pair (and learn from it) than that a system could at runtime encounter a fully-formed generation template. This kind of co-training could form the basis for a language model which is defined in broad strokes by system designers, and then incrementally refined post-deployment.