如何使用 python-docx 从现有的 docx 文件中提取文本

我正在尝试使用 python-docx模块(pip install python-docx) 但它似乎是非常混乱的,因为在 Github Repo测试样本中,他们使用的是 opendocx函数,但在 Readthedocs中,他们使用的是 Document类。即使它们只是显示如何向 docx 文件添加文本,而不是读取现有文件?

第一种(opendocx)不起作用,可能会被弃用。对于第二种情况,我试图使用:

from docx import Document


document = Document('test_doc.docx')
print(document.paragraphs)

它返回了一个 <docx.text.Paragraph object at 0x... >列表

然后我做了:

for p in document.paragraphs:
print(p.text)

它返回了所有的文本,但少了一些东西。所有 URL (CTRL + CLICK 转到 URL)在控制台上没有以文本形式显示。

问题是什么? 为什么 URL 丢失了?

如何在不循环迭代的情况下获得完整的文本(类似于 open().read())

174032 次浏览

There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.

The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.

Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it. UPDATE Hyperlink support was added subsequent to this answer.

You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

you can try this

import docx


def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)

Without Installing python-docx

docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

Using python-docx, as @Chinmoy Panda 's answer shows:

for para in doc.paragraphs:
fullText.append(para.text)

However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:

def para2text(p):
rs = p._element.xpath('.//w:t')
return u" ".join([r.text for r in rs])

It seems that there is no official solution for this problem, but there is a workaround posted here https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649

just update this file "...\site-packages\docx\oxml_init_.py"

# add
import re
import sys


# add
def remove_hyperlink_tags(xml):
if (sys.version_info > (3, 0)):
xml = xml.decode('utf-8')
xml = xml.replace('</w:hyperlink>', '')
xml = re.sub('<w:hyperlink[^>]*>', '', xml)
if (sys.version_info > (3, 0)):
xml = xml.encode('utf-8')
return xml
    

# update
def parse_xml(xml):
"""
Return root lxml element obtained by parsing XML character string in
*xml*, which can be either a Python 2.x string or unicode. The custom
parser is used, so custom element classes are produced for elements in
*xml* that have them.
"""
root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
return root_element


and of course don't forget to mention in the documentation that use are changing the official library