1. BeautifulSoup 对象

BeautifulSoup 的对象分为 4 类（官网文档）： - Tag - bs4.element.Tag - NavigableString - bs4.element.NavigableString - BeautifulSoup - - Comment - bs4.element.Comment

2. 文档遍历

.contents 和 .children 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title> 但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. .descendants 属性可以对所有tag的子孙节点进行递归循环。

下面代码是使用 BeautifulSoup 解析 html 内容后，对解析后的文档进行遍历：

from bs4 import BeautifulSoup, element

def get_content(soup):
    lines = list()
    for d in soup.descendants:
        if d.name == "img":
            imgurl = d.get("src").strip()
            lines.append(imgurl)
            continue
        if not isinstance(d, element.NavigableString):
            continue
        txt = d.get_text().strip()
        if not txt:
            continue
        lines.append(txt)

    return '\n'.join(lines)

如果文档对象是 image Tag 对象，那么就提取其中的图片地址
如果文档对象不是字符串，说明它还有子节点，那么当前这个父节点就可以直接忽略了
过滤后的文档对象，就只剩下 NavigableString 类型了，那么就提取其中的文本

BeautifulSoup 遍历所有的元素

1. BeautifulSoup 对象

2. 文档遍历