python beautifulsoup python beautifulsoup4 _生活百科

钢铁知识库，一个学习python爬虫、数据分析的知识库。人生苦短，快用python 。
上一章我们讲解针对结构化的html、xml数据，使用Xpath实现网页内容爬取。本章我们再来聊另一个高效的神器：Beautiful Soup4 。相比于传统正则表达方式去解析网页源代码，这个就简单得多，实践是检验真理的唯一标准，话不多说直接上号开搞验证。

文章插图
Beautiful Soup 简介
首先说说BeautifulSoup是什么。简单来说，这是Python的一个HTML或XML的解析库，我们可以用它方便从网页中提取数据，官方解释如下：

BeautifulSoup 提供一些简单的、Python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。BeautifulSoup 自动将输入文档转换为 Unicode 编码，输出文档转换为 utf-8 编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了。BeautifulSoup 已成为和 lxml、html5lib 一样出色的 Python 解释器，为用户灵活地提供不同的解析策略或强劲的速度。

所以，利用它可以省去很多繁琐的提取工作，提高解析效率。
BeautifulSoup 安装
BeautifulSoup3 目前已经停止开发，推荐使用 BeautifulSoup4，不过它也被移植到bs4了，也就是说导入时我们需要import bs4
在开始之前，请确保已经正确安装beautifulsoup4和lxml，使用pip安装命令如下：
pip install beautifulsoup4pip install lxml
解析器
BeautifulSoup在解析时实际上依赖解析器。除了支持Python标准库中的HTML解析器，还支持一些第三方的解析器，如果不安装它，则Python会使用默认的解析器。
下面列出BeautifulSoup支持的解析器

文章插图
【python beautifulsoup python beautifulsoup4】通过上面可以看出，lxml 有解析HTML和XML的功能，相比默认的HTML解析器更加强大，速度，容错能力强。
推荐使用它，下面统一使用lxml进行演示。使用时只需在初始化时第二个参数改为 lxml 即可。
from bs4 import BeautifulSoupsoup = BeautifulSoup('Hello', 'lxml')print(soup.p.string)'''Hello'''
基本使用
下面举个实例来看看BeautifulSoup的基本用法：
html = """<html><head><title>The Dormouse's story</title></head><body>The Dormouse's storyOnce upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well...."""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')# 初始化print(soup.prettify())print(soup.title.string)
运行结果，你们也可以将上面代码复制到编辑器执行看看：
<html> <head><title>The Dormouse's story</title> </head> <body>The Dormouse's storyOnce upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"></a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.... </body></html>The Dormouse's story