不知火舞的被虐|伊人天伊人天天综合网|博洛尼亚天气|任你懆这里只有精品4|久久美日韩精品久久|掌中之物漫画免费阅读观看|0丨d老妇

<strike id="o6owu"><input id="o6owu"></input></strike>

<ul id="o6owu"></ul>

網頁解析方法-BeautifulSoup簡明使用指南

博集華仿

2019年10月11日 15:14

瀏覽：3464 評論：2

摘要：獲得網頁的html文檔后，需要先解析html文檔，才能提取所需文本。BeautifulSoup是筆者認為最好用的網頁解析工具。

00 安裝bs庫

pip install bs4

01 解析html

網頁解析方法-BeautifulSoup簡明使用指南的圖1

import requests
import chardet
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozillaxxxxxxxxxx'}
link='https://xxxxxxxxxxxxxx'
res=requests.get(link,headers=headers,timeout=10)
res.encoding=chardet.detect(res.content)['encoding']
soup=BeautifulSoup(res.text,'lxml') #使用BeautifulSoup解析res

查看一下soup；

print(soup)

網頁解析方法-BeautifulSoup簡明使用指南的圖2

很像我們在瀏覽器上查看的html，有時候為了更好的排版，一般都使用；

print(soup.prettify())

網頁解析方法-BeautifulSoup簡明使用指南的圖3

其實BeautifulSoup的作用就是將html文檔轉化了一下（轉化成樹結構），并且在這個樹結構中，分為四種對象：Tag，NavigableString，Comment，BeautifulSoup。Tag對象就是原html的標記；NavigableString對象就是原html的文本；Comment對象特殊類型的NavigableString對象；BeautifulSoup對象就是文檔的全部內容。其中最重要的兩個對象是Tag和NavigableString。

02 元素定位（遍歷）

僅僅tag進行遍歷，只定位第一個元素；

soup.body.a

定位所有的子節點元素；

soup.body.contents
soup.body.children

可以加編號，定位某個；soup.body.children[2]

定位所有子孫節點（包括子節點的子節點）；

soup.body.descendants

定位父節點元素；soup.body.parent

定位父輩節點；soup.body.parents

定位兄弟節點：next_sibling，previous_sibling，next_siblings，previous_siblings；

定位元素內容：next_element，next_elements，previous_element，previous_elements

03 元素定位（搜索）

本文介紹soup.find_all()的使用方法，其它讀者可以舉一反三。

使用tag定位；soup.find_all('b')

tag里使用正則；soup.find_all(re.compile("^b"))

tag里使用列表，同時定位多個；soup.find_all(["a", "b"])

tag里使用True；soup.find_all(True)

使用屬性定位；soup.find_all(id='link2')；soup.find_all(attrs={"data-foo": "value"})

屬性里使用正則；soup.find_all(href=re.compile("elsie"))

屬性里使用True；soup.find_all(class_=True)

同時滿足多個屬性；soup.find_all(href=re.compile("elsie"), id='link1')

tag和屬性一起使用；soup.find_all("a", class_="sister")

使用內容定位；soup.find_all(string="Elsie")

內容里使用正則；soup.find_all(string=re.compile("Dormouse"))

內容里使用列表；soup.find_all(string=["Tillie", "Elsie", "Lacie"])

使用limit，限制搜索個數；soup.find_all("a", limit=2)

使用recursive=False，只搜索直接子節點；soup.html.find_all("title", recursive=False)

soup.find()和find_all()的區別，find()只搜索第一個。

04 元素定位（CSS選擇器）

使用tag；soup.select("body")；soup.select("p > a")

指定序號；soup.select("p > a:nth-of-type(2)")

使用#id；soup.select("#link1")

使用.class；soup.select(".sister")

同時定位多個；soup.select("#link1,#link2")

聯合使用；soup.select("p > #link1")；soup.select("a#link2")；soup.select('a[href]')；

soup.select('a[href^="http://example.com/"]')
soup.select('a[href*=".com/"]')
soup.select('a[href$="tillie"]')

只定位第一個；soup.select_one(".sister")

登錄后免費查看全文

立即登錄

App下載

技術鄰APP
工程師必備

項目客服
培訓客服
平臺客服

TOP

<blockquote id="eoyss"></blockquote>

<ul id="eoyss"></ul>

<strike id="eoyss"></strike>