GPT2-Large模型解碼方法---Top-K and Top-p sampling

計算巖土力學

2021年8月20日 10:10

1 引言

在《GPT2-Large模型解碼方法比較》中顯示了Beam search方法比greedy search方法的效果好，本文接著比較另外兩種解碼方法: Top-K sampling和Top-p sampling。

2 Top-K sampling

Facebook的Fan等人(2018)在他們的論文《Hierarchical Neural Story Generation(分層神經故事的產生)》引入了一個簡單但非常強大的取樣方案，稱之為Top-K抽樣。在Top-K抽樣中，首先選擇下一個詞最大可能出現的K個單詞，然后在這K個單詞中重新計算概率。GPT2使用了這種抽樣方案，這是它在故事生成方面取得成功的原因之一。在實踐中，通常取K值為40或50。在本次試驗中取top_k=50。仍然以"landslide produced by earthquakes"為引導句生成的結果如下：

landslide produced by earthquakes. The two parts of the landslide might be linked. "There was a huge change because of the tsunami," says Richard Zimba, a volcanologist at the Woods Hole Oceanographic Institution. "The sea level was a thousand meters higher than it is today. The whole of southern Utah was totally flooded, and the islands were lost." The landslide also set off a series of subsequent landslides and rockfalls that led back to its source. After the tsunami left much of the land in ruin, the landslide's debris washed out to sea. The watery disaster was triggered by the collapse of a massive tectonic plate, a gigantic column of crust that lies beneath the continental United States, Canada, Mexico, and Central America. "This is a classic example of a catastrophic collision," says Michael H. Hodge, a volcano expert at the British Columbia-based National Research Council. [由地震產生的山體滑坡。山體滑坡的兩個部分可能有聯系。"因為海嘯，發生了巨大的變化，"伍茲霍爾海洋研究所的火山學家理查德-津巴說。"當時的海平面比現在高一千米。整個猶他州南部完全被淹沒，島嶼也失去了。" 這場滑坡還引發了一系列后續的山體滑坡和落石，導致其起源。在海嘯使大部分土地成為廢墟之后，山體滑坡的碎片被沖到了海上。這場水災是由一個巨大的構造板塊坍塌引發的，這個板塊是位于美國大陸、加拿大、墨西哥和中美洲下面的巨大地殼柱。"位于不列顛哥倫比亞省的國家研究委員會的火山專家邁克爾-H-霍奇說："這是一個災難性碰撞的典型例子。]

試驗結果顯示，在Top-K sampling, 不設置no_repeat_ngram_size的值可能會得到內容連貫性更好的結果。

3 Top-p sampling

Top-p sampling是在Top-K sampling的基礎上發展起來的。在Top-p抽樣中，不是只從最有可能的K個詞中抽樣，而是從累積概率超過概率p的盡可能小的詞集中選擇，然后在這個小的詞集中重新分配概率。因此單詞集的數量可以根據下一個單詞的概率分布動態地增加和減少。在實踐中，top_p一般取0.95以上。本次試驗取top_p = 0.97。試驗結果如下：

landslide produced by earthquakes. The two parts of the landslide might be linked. "There was a huge change in the number of earthquakes, particularly on the coast, that produced an enormous amount of volcanic rock. And the amount of molten rock was a huge increase, that was going from the seafloor and then going towards the crust. Then suddenly, there were massive and sudden changes in the type of eruptions and other things. So that there is another link between the two. And if we can figure out exactly what that link is, we can come up with a way of getting rid of the massive amounts of volcanic rock." Another mystery is how the lava was brought to the surface. "When these giant flows start, they actually melt away under pressure, so they can't be brought to the surface. They just rise into the air in the night – this is why the Icelandic glacier is huge. [由地震產生的山體滑坡。山體滑坡的兩部分可能有聯系。"地震的數量發生了巨大的變化，特別是在沿海地區，產生了大量的火山巖。而熔巖的數量也大量增加，那是從海底出發，然后進入到地殼的。然后噴發的類型和其他事情發生了大規模的突然變化。因此，這兩者之間存在著另一種聯系。而如果我們能夠弄清楚這個聯系到底是什么，我們就可以想出一個辦法來擺脫大量的火山巖。" 另一個謎團是熔巖是如何被帶到地表的。"當這些巨流開始時，它們實際上在壓力下融化了，所以它們不能被帶到地表。它們只是在夜間升到空中--這就是為什么冰島的冰川是巨大的。]

3 哪個方法更好

雖然從理論上，解碼方法的質量Top-p>Top-K>Beam>Greedy，但是在實踐中Top-p可以與Top-K結合起來使用，可以避免排名很低的詞，同時也允許一些動態選擇。

outputs = model.generate(    input_ids,    max_length = 200,    early_stopping=True,    top_k=50,    top_p = 0.90,    do_sample=True)

另一方面，這種結論也不是絕對的。Welleck等人(2019) 在他們的論文《Neural Text Degeneration With Unlikelihood Training(神經性文本退化與不太可能的訓練)》爭論說Greedy search和Beam Search產生重復單詞序列的缺陷是由模型的訓練方式造成的，而不是解碼方法。同時他們還說根據人對生成句子的評價，當調整模型的訓練目標時，Beam搜索可以產生出比Top-p更流暢的文本。事實上我們的試驗也證明了這一點。另外，文本流暢是必須的，但文本內容應該才是最重要的。

4 結束語

開放式語言生成是一個快速發展的研究領域，由于語言生成的本質是概率統計，因而沒有一個放之四海而皆準的模型和解碼方法，必須具體問題具體分析。最后以一首詩結束這個筆記吧：

秋天是豐收的季節

我們要攜手創造美好的明天

讓我們一起手牽手

一起走向陽光明媚的春天

我們的夢想一定會實現

讓夢想永遠陪伴在你我身邊

讓愛的花朵永遠盛開在我們心間

備注: 這首詩由gpt2-chinese-lyric模型生成.

登錄后免費查看全文

立即登錄