關(guān)鍵詞提取---PyTextRank和Spacy的工作原理

計(jì)算巖土力學(xué)

2021年7月26日 13:58

1 引言

由于要寫研究報(bào)告，因此這個(gè)暑期的大部分文章將會與自然語言處理相關(guān)。PyTextRank <PyTextRank---文本關(guān)鍵字(keywords)的自動取出>作為Spacy管道的擴(kuò)展，用來處理基于圖的自然語言處理，構(gòu)筑知識圖譜實(shí)踐以及提取關(guān)鍵詞短語和摘要。它的基本操作過程是首先使用Spacy提取文本的名詞短語，然后對這些短語使用TextRank算法進(jìn)行排序。這個(gè)筆記檢查了PyTextRank和Spacy得出的結(jié)果，以決定在提取關(guān)鍵詞這個(gè)環(huán)節(jié)上是否還需要獨(dú)立使用Spacy，從而優(yōu)化代碼。測試使用的庫文件和模型如下：

pytextrank V3.1.1

Spacy V3.0.6 (最新版本V3.1.1)

en_core_web_md (V3.0.0 7/23/2021)

en_core_web_lg (V3.0.0 7/23/2021)

關(guān)鍵詞提取---PyTextRank和Spacy的工作原理的圖1

2 文本準(zhǔn)備

在這個(gè)測試中，使用了下面的文本，中文翻譯僅作為參考，不參與運(yùn)算。

text="Analyses by numerical methods are performed using the Fast Langrangian Analysis of Continua (FLAC), FLAC3D, Universal Distinct Element Code (UDEC), and 3DEC computer codes. From 1994 to 1997, FLAC was the most commonly used software for slope-stability analysis. In order to achieve a better representation of the real conditions, it was necessary to include explicitly in the model numerous major structures with several intersections. As the number of these explicit structures and their intersections increased, it was more and more difficult to construct the model. Due to this and the need to include explicitly all major structures, in 1998 the numerical analyses began to be done using UDEC, which allows an easier “handling” of the structures. In certain special cases, three-dimensional numerical models are used. Due to the larger engineering resources required by these three-dimensional models, their use is less frequent than the two-dimensional models. In 1998, 3DEC was used to develop a three-dimensional model of the southern sector of the Chuquicamata Mine. This was used, together with two-dimensional models and in situ observations, to predict the evolution of the subsidence that will affect the sector from 1999 to 2008." 使用FLAC、FLAC3D、Universal Distinct Element Code（UDEC）和3DEC等計(jì)算機(jī)軟件進(jìn)行了數(shù)值分析。從1994年到1997年，F(xiàn)LAC是最常用的邊坡穩(wěn)定性分析軟件。為了更好地表示實(shí)際情況，有必要在模型中顯式地包括許多有幾個(gè)交叉點(diǎn)的主要結(jié)構(gòu)。隨著這些顯式結(jié)構(gòu)及其交叉點(diǎn)數(shù)量的增加，構(gòu)建模型的難度也越來越大。由于這種情況和明確包括所有主要結(jié)構(gòu)的需要，從1998年開始使用UDEC進(jìn)行數(shù)值分析，它可以更容易地 "處理 "這些結(jié)構(gòu)。在某些特殊情況下，會使用三維數(shù)值模型。由于這些三維模型需要較大的工程資源，它們的使用不如二維模型頻繁。1998年，3DEC被用來開發(fā)Chuquicamata礦南區(qū)的三維模型。該模型與二維模型和現(xiàn)場觀測一起，被用來預(yù)測1999年至2008年影響該區(qū)的沉降演變。

3 PyTextRank計(jì)算結(jié)果

在這個(gè)測試中(geotech-PyTextRank.py)，使用了en_core_web_lg模型(741 MB), 共取出25個(gè)關(guān)鍵詞短語，排名前10位的短語如下：

numerous major structures

several intersections
numerical methods
situ observations
the Chuquicamata Mine
three-dimensional numerical models
two-dimensional models
all major structures
the southern sector
slope-stability analysis

同時(shí)，也比較了en_core_web_sm和en_core_web_lg的計(jì)算結(jié)果，發(fā)現(xiàn)沒有太大差別。

4 Spacy計(jì)算結(jié)果

使用Spacy加載同樣的模型，得出的名詞短語與PyTextrank的結(jié)果相同，由此可見，PyTextrank對Spacy得出的結(jié)果確實(shí)沒有進(jìn)行進(jìn)一步加工。Spacy使用doc.noun_chunks進(jìn)行關(guān)鍵詞提取。其工作原理是：遍歷文檔中的基礎(chǔ)名詞短語。如果文檔已被語法解析，則產(chǎn)生基礎(chǔ)名詞短語Span對象。基準(zhǔn)名詞短語，或稱 "NP chunk"，是一個(gè)不允許其他NP嵌套在其中的名詞短語---因此沒有NP級協(xié)調(diào)，沒有介詞短語，也沒有從句。

doc = nlp(text)

Doc類是一個(gè)訪問語言注釋的容器。此外，Toekn類進(jìn)行預(yù)料分類: token.pos_ == "VERB", 得出這段文本沒有重復(fù)的動詞列表: ['achieve', 'affect', 'allow', 'begin', 'construct', 'develop', 'do', 'include', 'increase', 'perform', 'predict', 'require', 'use']。

Spacy的實(shí)體判別(doc.ents)把軟件都?xì)w結(jié)到ORG，這個(gè)可以在代碼中定制自己定義的實(shí)體名稱,以后詳述。

the Fast Langrangian Analysis ORG

Continua PERSON

FLAC3D CARDINAL

UDEC ORG

3DEC NORP

1994 to 1997 DATE

1998 DATE

UDEC ORG

three CARDINAL

two CARDINAL

1998 DATE