BeautifulSoup 教程

Python爬虫：requests库配合BeautifulSoup进行网页抓取 #生活技巧# #工作学习技巧# #编程语言学习路径#

BeautifulSoup 教程是 BeautifulSoup Python 库的入门教程。这些示例查找标签，遍历文档树，修改文档和刮取网页。

BeautifulSoup

BeautifulSoup 是用于解析 HTML 和 XML 文档的 Python 库。它通常用于网页抓取。 BeautifulSoup 将复杂的 HTML 文档转换为复杂的 Python 对象树，例如标记，可导航字符串或注释。

安装 BeautifulSoup

我们使用pip3命令安装必要的模块。

$ sudo pip3 install lxml

Python

我们需要安装 BeautifulSoup 使用的lxml模块。

$ sudo pip3 install bs4

Python

上面的命令将安装 BeautifulSoup。

HTML 文件

在示例中，我们将使用以下 HTML 文件：

index.html

<!DOCTYPE html> <html> <head> <title>Header</title> <meta charset="utf-8"> </head> <body> <h2>Operating systems</h2> <ul id="mylist" style="width:150px"> <li>Solaris</li> <li>FreeBSD</li> <li>Debian</li> <li>NetBSD</li> <li>Windows</li> </ul> <p> FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms. </p> <p> Debian is a Unix-like computer operating system that is composed entirely of free software. </p> </body> </html>

Python

BeautifulSoup 简单示例

在第一个示例中，我们使用 BeautifulSoup 模块获取三个标签。

simple.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.h2) print(soup.head) print(soup.li)

Python

该代码示例将打印三个标签的 HTML 代码。

from bs4 import BeautifulSoup

Python

我们从bs4模块导入BeautifulSoup类。 BeautifulSoup是从事工作的主要班级。

with open("index.html", "r") as f: contents = f.read()

Python

我们打开index.html文件并使用read()方法读取其内容。

soup = BeautifulSoup(contents, 'lxml')

Python

创建了BeautifulSoup对象； HTML 数据将传递给构造函数。第二个选项指定解析器。

print(soup.h2) print(soup.head)

Python

在这里，我们打印两个标签的 HTML 代码：h2和head。

print(soup.li)

Python

有多个li元素；该行打印第一个。

$ ./simple.py <h2>Operating systems</h2> <head> <title>Header</title> <meta charset="utf-8"/> </head> <li>Solaris</li>

Python

这是输出。

BeautifulSoup 标签，名称，文本

标记的name属性给出其名称，text属性给出其文本内容。

tags_names.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print("HTML: {0}, name: {1}, text: {2}".format(soup.h2, soup.h2.name, soup.h2.text))

Python

该代码示例打印h2标签的 HTML 代码，名称和文本。

$ ./tags_names.py HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems

Python

这是输出。

BeautifulSoup 遍历标签

使用recursiveChildGenerator()方法，我们遍历 HTML 文档。

traverse_tree.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') for child in soup.recursiveChildGenerator(): if child.name: print(child.name)

Python

该示例遍历文档树并打印所有 HTML 标记的名称。

$ ./traverse_tree.py html head title meta body h2 ul li li li li li p p

Python

在 HTML 文档中，我们有这些标签。

BeautifulSoup 子元素

使用children属性，我们可以获取标签的子级。

get_children.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') root = soup.html root_childs = [e.name for e in root.children if e.name is not None] print(root_childs)

Python

该示例检索html标记的子代，将它们放置在 Python 列表中，然后将其打印到控制台。由于children属性还返回标签之间的空格，因此我们添加了一个条件，使其仅包含标签名称。

$ ./get_children.py ['head', 'body']

Python

html标签有两个子元素：head和body。

BeautifulSoup 后继元素

使用descendants属性，我们可以获得标签的所有后代（所有级别的子级）。

get_descendants.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') root = soup.body root_childs = [e.name for e in root.descendants if e.name is not None] print(root_childs)

Python

该示例检索body标记的所有后代。

$ ./get_descendants.py ['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

Python

这些都是body标签的后代。

BeautifulSoup 网页抓取

请求是一个简单的 Python HTTP 库。它提供了通过 HTTP 访问 Web 资源的方法。

scraping.py

#!/usr/bin/python3 from bs4 import BeautifulSoup import requests as req resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') print(soup.title) print(soup.title.text) print(soup.title.parent)

Python

该示例检索一个简单网页的标题。它还打印其父级。

resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml')

Python

我们获取页面的 HTML 数据。

print(soup.title) print(soup.title.text) print(soup.title.parent)

Python

我们检索标题的 HTML 代码，其文本以及其父级的 HTML 代码。

$ ./scraping.py <title>Something.</title> Something. <head><title>Something.</title></head>

Python

这是输出。

BeautifulSoup 美化代码

使用prettify()方法，我们可以使 HTML 代码看起来更好。

prettify.py

#!/usr/bin/python3 from bs4 import BeautifulSoup import requests as req resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') print(soup.prettify())

Python

我们美化了一个简单网页的 HTML 代码。

$ ./prettify.py <html> <head> <title> Something. </title> </head> <body> Something. </body> </html>

Python

这是输出。

BeautifulSoup 通过 ID 查找元素

使用find()方法，我们可以通过各种方式（包括元素 ID）查找元素。

find_by_id.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') #print(soup.find("ul", attrs={ "id" : "mylist"})) print(soup.find("ul", id="mylist"))

Python

该代码示例查找具有mylist ID 的ul标签。带注释的行是执行相同任务的另一种方法。

BeautifulSoup 查找所有标签

使用find_all()方法，我们可以找到满足某些条件的所有元素。

find_all.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') for tag in soup.find_all("li"): print("{0}: {1}".format(tag.name, tag.text))

Python

该代码示例查找并打印所有li标签。

$ ./find_all.py li: Solaris li: FreeBSD li: Debian li: NetBSD

Python

这是输出。

find_all()方法可以获取要搜索的元素列表。

find_all2.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tags = soup.find_all(['h2', 'p']) for tag in tags: print(" ".join(tag.text.split()))

Python

该示例查找所有h2和p元素并打印其文本。

find_all()方法还可以使用一个函数，该函数确定应返回哪些元素。

find_by_fun.py

#!/usr/bin/python3 from bs4 import BeautifulSoup def myfun(tag): return tag.is_empty_element with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tags = soup.find_all(myfun) print(tags)

Python

该示例打印空元素。

$ ./find_by_fun.py [<meta charset="utf-8"/>]

Python

文档中唯一的空元素是meta。

也可以使用正则表达式查找元素。

regex.py

#!/usr/bin/python3 import re from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') strings = soup.find_all(string=re.compile('BSD')) for txt in strings: print(" ".join(txt.split()))

Python

该示例打印包含“ BSD”字符串的元素的内容。

$ ./regex.py FreeBSD NetBSD FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Python

这是输出。

BeautifulSoup CSS 选择器

通过select()和select_one()方法，我们可以使用一些 CSS 选择器来查找元素。

select_nth_tag.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.select("li:nth-of-type(3)"))

Python

本示例使用 CSS 选择器来打印第三个li元素的 HTML 代码。

$ ./select_nth_tag.py <li>Debian</li>

Python

这是第三个li元素。

CSS 中使用# 字符通过 ID 属性选择标签。

select_by_id.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.select_one("#mylist"))

Python

该示例打印具有mylist ID 的元素。

BeautifulSoup 追加元素

append()方法将新标签附加到 HTML 文档。

append_tag.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') newtag = soup.new_tag('li') newtag.string='OpenBSD' ultag = soup.ul ultag.append(newtag) print(ultag.prettify())

Python

该示例附加了一个新的li标签。

newtag = soup.new_tag('li') newtag.string='OpenBSD'

Python

首先，我们使用new_tag()方法创建一个新标签。

ultag = soup.ul

Python

我们获得对ul标签的引用。

ultag.append(newtag)

Python

我们将新创建的标签附加到ul标签。

print(ultag.prettify())

Python

我们以整齐的格式打印ul标签。

BeautifulSoup 插入元素

insert()方法在指定位置插入标签。

insert_tag.py

Python

该示例将第三个位置的li标签插入ul标签。

BeautifulSoup 替换文字

replace_with()替换元素的文本。

replace_text.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tag = soup.find(text="Windows") tag.replace_with("OpenBSD") print(soup.ul.prettify())

Python

该示例使用find()方法查找特定元素，并使用replace_with()方法替换其内容。

BeautifulSoup 删除元素

decompose()方法从树中删除标签并销毁它。

decompose_tag.py

#!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') ptag2 = soup.select_one("p:nth-of-type(2)") ptag2.decompose() print(soup.body.prettify())

Python

该示例删除了第二个p元素。

在本教程中，我们使用了 Python BeautifulSoup 库。

您可能也会对以下相关教程感兴趣：Python 教程，Openpyxl 教程，Pyquery 教程，Python 列表推导，Python CSV 教程。

BeautifulSoup

安装 BeautifulSoup

HTML 文件

BeautifulSoup 简单示例

BeautifulSoup 标签，名称，文本

BeautifulSoup 遍历标签

BeautifulSoup 子元素

BeautifulSoup 后继元素

BeautifulSoup 网页抓取

BeautifulSoup 美化代码

BeautifulSoup 通过 ID 查找元素

BeautifulSoup 查找所有标签

BeautifulSoup CSS 选择器

BeautifulSoup 追加元素

BeautifulSoup 插入元素

BeautifulSoup 替换文字

BeautifulSoup 删除元素

相关内容

随便看看

最新乐趣

热点乐趣

专题

推荐乐趣