基于馬爾可夫鏈(markovify)文本生成代碼的改進
1 引言
自動生成文本是自然語言處理中一個非常有趣的研究領域,目前主要有兩種途徑實現這個功能:第一種方法是深度學習,典型的例子是利用Transformers的"text-generation"管道,這種方法的理論基礎是因果語言模擬(causal language modeling), 默認的模型是GPT-2,使用Top-K采樣《開放式文本生成(Open-Ended Text Generation》; 在此基礎上發展的aitextgen功能更強大一些,不過aitextgen好像不能在本機上訓練自己的數據,不清楚什么原因, 只能使用Colab。第二種方法是馬爾可夫鏈《馬爾可夫鏈(Markov chain)隨機產生新的文檔》。這個筆記簡要記錄了對geotech-markovify-text-generation.py的改進,這個改進提高了生成句子的質量。
2 改進方法
盡管深度學習Transformers使用了大的模型GPT-2, 但測試結果顯示對于我們特定的專業領域,這些模型并不能給出令人滿意的結果,主要原因是這些模型中沒有包含專業的知識庫,因而生成的句子雜亂無章沒有邏輯,這也是我們努力改造馬爾可夫鏈的主要原因。另一方面,大而雜亂的數據集不能產生出合理的邏輯性非常強的句子,一個主題突出的數據集更容易產生出有實際意義的句子。因此改進的第一步是合并了geotech-flashtext-passages.py中的算法,通過主題關鍵詞產生出一個聚合的小型數據集,把產生的這個數據集作為馬爾可夫鏈的輸入文件。
第二個改進是增加了一個文本清理子程序,清除文件中存在的雜亂結構,包括空行,無意義的字符以及小于一定長度的句子。
第三個改進是在代碼中同時增加了兩個類POSifiedText_Spacy和POSifiedText_NLTK,用來改進目前的markovify.Text方法。在POSifiedText_Spacy中,使用了最新的en_core_web_lg模型。這種改進的優點是極大地改善了生成句子的質量,缺點是對于大的數據集,運行時間變慢,特別是POSifiedText_Spacy方法,在一個40M的數據集測試中,訓練時間花了接近50分鐘。
因此,目前的代碼中包括了三種句子生成方法。假如設定每種方法都產生5個句子,那么每次運行能同時產生出15個句子。
3 試驗例子
作為一個試驗例子,首先根據主題"rock slope failure"聚合一個小型的數據集,然后運行代碼geotech-markovify-text-generation.py。一個小的改進是給定一個詞,可以列出這個詞所有鄰接的下一個單詞,例如"stability", 后接的名詞有:
analysis
issues
assessment
conditions
evaluation
curves
problem
prospective
approaches
field
charts
calculations
computations.
models
這個功能不僅可以用來輔助教學,也可以幫助論文寫作。接下來要生成一些與"stability analysis"相關的句子。每種方法均選擇生成5個句子,因此總共生成了15個句子。
按照詞云,這個數據集最top的關鍵詞為: step-path, rock bridge, path failure, rock slope, rock mass, failure mode, intact rock
[1] stability analysis through consideration of the role of stress-induced damage on slope performance.
[2] stability analysis is a conceptual illustration of possible rock slope investigations and finds application in a 3D to a stress-dependent failure mechanism is of great interest in rock slopes --- As large open pit mine.
[3] stability analysis is statically indeterminate and the overall block stability was assessed for 12 metre bench heights using planar and wedge failures.
[4] stability analysis of rock slopes, it is becoming increasingly necessary to consider the interaction between intact rock bridge content or percentage remains one of the open pit, notably in terms of expected breakback angles using a stiff modular applied static loading to fulfill visual excavation to the unfavourable orientations of discontinuities.
[5] stability analysis of planar, wedge and stepped path failures were presented in terms of these limitations with respect to block forming potential and kinematics.
[6] stability analysis , performed using the hybrid FDEM code , ELFEN with fracture mechanics criteria , is moving under the assumption of fully continuous lateral releases , or whether the planes are located so that they actually intersect behind the slope along the line of intersection .
[7] stability analysis used the overlay linear - element process based on the determination of relationships between tension cracks on the stability of rock mass was relatively poor , the dip , dip direction , nature and type of joint coalescence is considered conservative compared to the intersection of the current geographic condition the stability of rock slope instability provided enough block size in the orthogneiss rock unit Two new stereographic projection methods in the model simulations .
[8] stability analysis --- The importance of 3D step - path discontinuities and intact rock fractures and step - path failure are presented in Chapter 7 where step - path failure are important for the East Wall are going to be performed .
[9] stability analysis package using the limit equilibrium methods exist incorporating step - path failure .
[10] stability analysis for the idealised slope geometry .
[11] stability analysis is statically indeterminate and the collapse manifold was planar or wedge failure.
[12] stability analysis of rock slopes---Field data collection of the slope because, for example, a wedge resting on two intersecting discontinuities is of great interest in rock slopes---Wedge analyses for sandstones and quartzites have also been carried out using the SWEDGE software, allowed the identification of fracture propagation using both the hybrid code ELFEN in modelling and highly affects the stability of the rock mass dilation in facilitating the slope toe leading to failure of rock slopes.
[13] stability analysis package using the wedge stability with wedge failure and should be aware of these methods, and rock mass through fracture initiation, propagation and coalescence.
[14] stability analysis and slope monitoring data emphasising the control of fracture initiation and propagation.
[15] stability analysis tool Universal Distinct Element Code Visage is a conceptual large open pit slopes.
3 結束語
本文記錄了代碼geotech-markovify-text-generation.py的主要改進過程,生成句子的質量雖然比以前使用的方法提高了不少,但其算法仍有待進一步改進,例如在生成句子后自動識別生成句子的語法關系,對錯誤的語法關系進行改正。
工程師必備
- 項目客服
- 培訓客服
- 平臺客服
TOP




















