python 从 doc docx 文件提取文本

创建日期: 2024-09-06 18:29 | 作者: 风波 | 浏览次数: 15 | 分类: Python

使用 opendocx

来源: - https://stackoverflow.com/questions/42482/best-way-to-extract-text-from-a-word-doc-without-using-com-automation

install python-docx
from docx import Document

document = Document('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

使用 antiword

apt install antiword
antiword -m UTF-8 abc.doc
15 浏览
8 爬虫
0 评论