In this blog post I present an idea for how human evaluation of translation quality can be incorporated into a method for determining the quality of systems that automatically translates from one (human) language to another.
Before getting started it is appropriate to issue a spoiler alert! This post does not present any concrete results. The main point of this blog post is to present an idea that I (or someone else) will verify through experiments. In fact, the idea is not presented as an algorithm but as a method which can be implemented in a variate of ways.
The Problem
When evaluating the quality of automatically translated text it is not so easy to algorithmically specify what it means for a translation to be good or bad or maybe just OK. Yes, there are numerous measures that can be concocted that will indicate that a translation is in some sense close to a reference text. But, being close based on some measure such as edit distance or precision does not say much about the quality of the translation.
In this short blog post I outline a simple method that should generate quality scores that are comparable to quality scores created by human evaluation. I say should since I have only done a few experiments with a small amount of data (while at the same time learning about tensorflow and neural networks).
There are a number of measurements that can be used to evaluate if a sentence is close to a reference sentence. Some measurements are:
- edit distance
- longest common subsequence
There are of course numerous other measures (or combination of measures) that can be used to indicate if a sentence is in one way or another close to another sentence.
The problem with such technical measures (or features) is that they say little about the quality of a translation. This should not come as a surprise since there is no algorithmic definition of what constitutes a good translation. From my experience working with computational linguists the only method that can generate a reliable quality measure is to have a human evaluate the quality using his or hers language skills.
I'll give an example that illustrates the problem with not being able to reliably evaluate the quality of translation. When comparing SMT (statistical machine translation) systems with NNT (neural network translation) systems (or any other type of system) a precision based score called BLEU is often used as a quality measure. Listening in on comments from computational linguists I have heard comments like: the NNT translation is much better but have about the same BLEU score as the SMT translation. Surprisingly, even though it is patently clear that BLEU is not at all a reliable measure of quality, it is still being used - mostly for the reason that it is the best there is.
The Idea
Since we don't have an algorithmic (or formulaic) description of what constitutes translation quality its clear that human evaluation must play a part if we want a create a quality measure that makes sense. The question is: how can human evaluation be used in an automated method that calculates a translation quality score. Certainly, we do not want a human to evaluate every single translated sentence.
The solution is conceptually quite simple. Given a reference sentence and a translated sentence we evaluate a number of measurements (or features) over the sentence pair. The sentence pair is also evaluated and scored by a human. For example, the score could be a number pulled from a range - say 1 ... 10. The feature values together with the human evaluated score can now serve as training data for a neural network.
Once the network has been trained, it can be used when evaluating the translation quality of one or more sentences (as long as the translated sentences have corresponding reference sentences). A set of features are first evaluated on the translated sentences and the reference sentences after which they are fed into the neural network. The output from the neural network is now a quality score which should be comparable to a human evaluation of the translation.
Why should this approach work? The reasoning goes along the following lines. Each feature should in one way or another state something about translation quality (or lack thereof). It seems reasonable that some (non-linear) combination of the calculated features should capture what constitutes translation quality. The problem is that we don't know how to combine features so that we can calculate a meaningful quality score. A simple brute force way of creating a non-linear combination of features is to train a neural network so that it learns what translation quality means in terms of features.
Now, this may or may not work. Clearly we must include the right features. What the right features are is probably not a trivial task to decide on. However, it should not hurt to include features which are not essential since such features should be deemed irrelevant during the training of the neural network. If we miss some essential features it should show up as poor test results for the neural network.
What I have done so far
As mentioned earlier, this post is mostly about the idea or method - not about presenting the result of a full blown experiment. Due to lack of time, I have only implemented a small experimental 3 layer neural network using tensorflow where the input was defined by the following 8 features:
- edit distance (word and character)
- longest common subsequence (word and character)
- longest common sequence (word and character)
- precision
- accuracy
In total I manually evaluated 100 sentence pairs assigning a quality score to each one (which even by a long shot is probably not enough training data). The quality scores together with calculated features were used to train the neural network. The network was then used (in an ad-hoc non-systematic fashion) to evaluate segments generated by a translation system against their corresponding reference sentences. The result showed that the network mapped features relatively close (using manual inspection) to scores I assigned to reference segments.
Unfortunately I have not had the time to gather more training data and do a proper objective systematic evaluation of the results. At this stage this project is still at the stage where I consider it a cool idea that needs further work.
Hopefully I will have time to work on the idea presented here in the following weeks. I also hope to be able to present a systematic objective result from a trained quality evaluation network. Until then, if someone has ideas for why the method described here will or will not work, feel free to let me know.
A few random observations
A potential drawback of the method is that each language requires its own quality evaluation network. Practically this means that training data for each language must be manually prepared which involves a non-negligible amount of work. However, once a network has been trained it should be usable when evaluating any automatic translation system.
A Computational Linguist colleague of mine suggested that it might be possible to use a quality evaluation network as an error function when training a translation system (SMT or NNT). If the method presented in this blog works OK the idea would be worth investigating.
Why did I choose a 3 layer design of the neural network - why not 2? The reason was rather simple: the idea was that the middle layer would create abstractions that would be aligned with human related language qualities - for example fluency and similarity. I don't have enough knowledge of neural network design to really say if the idea is correct or not. Maybe after understanding more about neural networks I will be able to verify it.
No comments:
Post a Comment