Python processing Word documents

Processing Word Documents with Python

To read a Word document, we need to use the docx module. First, install docx as shown in the example below. Then, write a program that uses the various functions in the docx module to read the entire document by paragraph.

We use the following command to import the docx module into our environment.

pip install docx

In the following example, we read the contents of a Word document by appending each line to a paragraph and finally print out the text of all paragraphs.

import docx

def readtxt(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return 'n'.join(fullText)

print (readtxt('pathTutorialspoint.docx'))

When we run the above program, we get the following output –

Tutorials Point originated from the idea that there exists a class of readers who respond
better to online content and prefer to learn new skills at their own pace from the comforts
of their drawing rooms.

The journey commenced with a single tutorial on </pre>
<h2>Reading a Single Paragraph</h2>
<p>We can use the paragraphs property to read specific paragraphs from a Word document. In the following example, we only read the second paragraph in a Word document. </p>
<pre><code class="language-python line-numbers">import docx

doc = docx.Document('pathTutorialspoint.docx')
print len(doc.paragraphs)

print doc.paragraphs[2].text

Running the above program, we get the following output −

The journey commenced with a single tutorial on HTML in 2006 and elated by the response
it generated, we worked our way to adding fresh tutorials to our repository
which now proudly flaunts a wealth of tutorials and allied articles on topics
ranging from programming languages to web designing to academics and much more.

Leave a Reply

Your email address will not be published. Required fields are marked *