- docx 是一个 zip 文件
- doc 不是一个标准的 zip 文件,所以需要使用其它工具来处理
使用 opendocx
来源: - https://stackoverflow.com/questions/42482/best-way-to-extract-text-from-a-word-doc-without-using-com-automation
- https://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python
install python-docx
from docx import Document
document = Document('Hello world.docx')
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
# Extract all text
print getdocumenttext(document)
使用 antiword
apt install antiword
antiword -m UTF-8 abc.doc