Google’s BLEURT is BERT for Evaluating Natural Language Generation Models
The new method uses BERT pretrained models to evaluate the quality of the output of NLG models.
Natural language generation(NLG) is one of the fastest growing areas of research in deep learning. NLG applications are everywhere around us in areas such as text summarization, question-answering, translation and many others. One of the regular challenges in the NLG space is the quality evaluation of models. Most methods todays rely on human evaluation which has obvious subjectivity and scale limitations. In a recent paper, Google Research proposed BLEURT, a transfer learning model that can achieve human quality levels in the scoring of NLG systems.
The idea of BLEURT is to address some of the limitations of human evaluation in NLG systems while helping improve NLG models. Transformer architectures like Google BERT achieved record levels in different natural language understanding(NLU) tasks. In order to do that, BERT had to built implicit knowledge about text quality. BLEURT tries to leverage these capabilities of BERT in order to develop a scoring method for NLG systems that matches human performance.
What Makes a Good NLG Sentence?
To understand the magnitude of the challenge BLEURT is trying to address, it might be helpful to first develop a criteria to evaluate the quality of a NLG sentence. When presented with a specific text, how do we judge its quality? There are many aspects that go into consideration. Fluency, clarity of the main idea, expressiveness, syntactic and semantic correctness and dozens of others. Many of those criterions are incredibly subjective but human judgement but still is possible to develop quantitative metrics that reflects its results.
The rapid evolution of NLU systems prompted the creation of different metrics to evaluate its performance. In the machine translation space, one of the metrics that have seen the widest adoption is the bilingual evaluation understudy(BLEU). Conceptually, BLEU tries to quantify the quality of a text translated from one language to another. BLEU works by comparing any individual translated segments with reference translations and averaging the results of the whole corpus for a final score. The results are a value from 0–1 or 1–100 where the highest score represents the highest similarity of the reference text. The following illustrates a sample BLEU evaluation. In that result, we can see that the first sentence clearly obtains the highest BLEU score but, interestingly enough, the third sentence obtains a higher score than the second one due to a better syntactic structure.
In addition to metrics like BLEU, the machine learning community has produced many reference datasets that can be used to evaluate the performance of NLG systems. The WMT Metrics Shared Task dataset is one of the most popular sources to evaluate the benchmark the performance of NLG tasks.
Now we have an idea of how human evaluation works for NLG systems. Can that be recreated with machine learning methods.
The fundamental challenge of creating a quality metric for NLG system is that it should not only match human judgement but do so across all sorts of conversational domains. This challenge is even more relevant if we consider the small amount of training data available. Fortunately, Google has been at the forefront of some of the most impressive breakthroughs in language understanding and representations. BERT sparked a new wave of innovation in the NLU space with its Transformer based architecture that were able to achieve state of the art performance across different NLP tasks. BERT seems like an ideal candidate in order to leverage unsupervised representations that could mitigate the absence of large training datasets for NLG evaluation tasks.
Architecturally, BLEURT leverage transfer learning of BERT pretrained models. Specifically, BLEURT relies on a technique to “warm up” BERT using a large number of synthetic sentence pairs before fine-tuning the process with human rating data. The BERT pretraining scheme has three key requirements:
1) The dataset of used references should be large and diverse so that the scoring metric could be applicable to diverse NLG tasks.
2) The sentence-pairs should be diverse in syntactical structure.
3) The pretraining objectives should be able to capture scenarios such as phrase omissions, substitutions or noise which are common in NLG texts.
Following those requirements, the BLEURT model can be broken down in four fundamental steps. The initial step is the pretraining of BERT followed by a pretraining based on the synthetic pairs mentioned previously. The first pretraining cycle targets general language modeling objectives while the second is specifically targeting NLG evaluation objectives.
After that, BLEURT is fine-tuned using a collection of publicly available human ratings. This is the source of BLEURT’s biggest innovation. First, BLEURT executes this step after pretraining the BERT model which already has developed a representation model of the text. Secondly, instead of collecting human ratings like more NLG-scoring methods, BLEURT relies on metrics such as BLEU which can be easily applied across all sorts of NLG tasks.
The final challenge is the training dataset itself. Instead of relying on the small datasets available. BLEURT generates a training dataset by introducing small perturbations in Wikipedia articles and collecting scores such as BLEU.
Google benchmarked BLEURT against alternative methods using the aforementioned WMT Shared Tasks dataset. The results showed that BLEURT achieve higher levels of performance as well as correlation with human ratings.
Additionally, Google evaluated BLEURT in the famous WebNLG challenge that looks to generate texts from RDF triples. In that scenario, BLEURT also outperformed alternative models in achieving high correlation with human ratings.
BLEURT is an interesting method to show that deep learning models can achieve human-level performance in highly subjective tasks such as evaluating the quality of text. Even more importantly is the fact that BLEURT might help accelerate the training and evaluation of a new generation of NLG methods without having to require large amounts of human rating.