BeautifulSoup 就是 Python 的一个 HTML 或 XML 的解析库,我们可以用它来方便地从网页中提取数据
解析器比较
BeautifulSoup 支持的解析器及优缺点
解析器
使用方法
优势
劣势
Python标准库
BeautifulSoup(markup, "html.parser")
Python的内置标准库、执行速度适中 、文档容错能力强
Python 2.7.3 or 3.2.2)前的版本中文容错能力差
LXML HTML 解析器
BeautifulSoup(markup, "lxml")
速度快、文档容错能力强
需要安装C语言库
LXML XML 解析器
BeautifulSoup(markup, "xml")
速度快、唯一支持XML的解析器
需要安装C语言库
html5lib
BeautifulSoup(markup, "html5lib")
最好的容错性、以浏览器的方式解析文档、生成 HTML5 格式的文档
速度慢、不依赖外部扩展
文档
用法
简单用例
样本
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
# 只能选择第一个p节点
print(soup.p)
运行结果:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)
运行结果:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
关联选择
子节点和子孙节点
获取直接子节点可以调用 contents 属性
实例:获取body节点下子节点p
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.body.contents)
运行结果:返回结果是列表形式
['\n', <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>, '\n', <p class="story">...</p>, '\n']
可以调用 children 属性,得到相应的结果:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.body.children)
for i,child in enumerate(soup.body.children):
print(i,child)
运行结果:
<list_iterator object at 0x00000217D33CD048>0
1 <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p>2
3 <p class="story">...</p>4
要得到所有的子孙节点的话可以调用 descendants 属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.body.descendants)
for i,child in enumerate(soup.body.descendants):
print(i,child)
运行结果:
<generator object descendants at 0x0000014D106353B8>0
1 <p class="story"> Once upon a time there were
...
父节点和祖先节点
要获取某个节点元素的父节点,可以调用 parent 属性:
实例:获取节点a的父节点p下的内容
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
运行结果:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
html = """
<html>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
Hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print('next sibling:',soup.a.next_sibling)
print('previous sibling:',soup.a.previous_sibling)
print("next siblings:",list(soup.a.next_siblings))
print("previouos siblings:",list(soup.a.previous_siblings))
运行结果:
next sibling:
Hello
...
提取信息
获取一些信息,比如文本、属性等等
html = """
<html>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])
运行结果:
Next Sibling:
<class 'bs4.element.Tag'>
...
方法选择器
常用查询方法:find_all()、find()
find_all()
查询所有符合条件的元素
语法:
find_all(name , attrs , recursive , text , **kwargs)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
# print(soup.find_all(name='ul'))
# print(type(soup.find_all(name='ul')[0]))
for ul in soup.find_all(name='ul'):
print(ul.find_all(name='li'))
for li in ul.find_all(name='li'):
print(li.string)
html='''
<div class="panel">
<div class="panel-body">
<a>Hello, this is a link</a>
<a>Hello, this is a link, too</a>
</div>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
# 查询文本包含有link的文本
print(soup.find_all(text=re.compile('link')))
运行结果:
['Hello, this is a link', 'Hello, this is a link, too']
html='''
<div class="panel">
<div class="panel-body">
<a class='element'>Hello, this is a link</a>
<a class='element'>Hello, this is a link, too</a>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find(name='a'))
print(soup.find(attrs={'class':'element'}))
print(soup.find(class_='element'))
print(type(soup.find(name='a')))
返回结果:返回类型为<class 'bs4.element.Tag'>
<a class="element">Hello, this is a link</a>
<a class="element">Hello, this is a link</a>
<a class="element">Hello, this is a link</a>
<class 'bs4.element.Tag'>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
运行结果:
list-1
list-1
list-2
list-2
获取文本
获取文本可以用string 属性,还有一种方法那就是 get_text(),同样可以获取文本值。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print('GET TEXT:',li.get_text())
print('STRING:',li.string)
html = '''
<div class="wrap">
Hello, World
<p>This is a paragraph.</p>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
运行结果:
Hello, World
This is a paragraph.
例子:移除p节点
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
wrap.find('p').remove()
print(wrap.text())