How to extract the url in hyperlinks from a docx file using python

user2968505 picture user2968505 · Nov 7, 2016 · Viewed 7.4k times · Source

I've been trying to find out how to get urls from a docx file using python, but failed to find anything, i've tried python-docx, and python-docx2txt, but python-docx only seems to extract the text, while python-docx2txt is able to extract the text from the hyperlink but not the urls themselves.

Answer

Lapyiu  picture Lapyiu · Dec 11, 2018

I am a beginner on Python and have an assignment to use Python to change each hyperlink in a .docx document. Thanks to Kiran's code which gave me hints to do a few guess, trial and errors and finally get it working. Here is the code I have and like to share with other beginners.

# python to change docx URL hyperlinks:
### see: https://stackoverflow.com/questions/40475757/how-to-extract-the-url-in-hyperlinks-from-a-docx-file-using-python

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

print(" This program changes the hyperlinks detected in a word .docx file \n")

docx_file=input(" Pls input docx filename (without .docx): ")

document = Document(docx_file + ".docx")

rels = document.part.rels

for rel in rels:
   if rels[rel].reltype == RT.HYPERLINK:
      print("\n Origianl link id -", rel, "with detected URL: ", rels[rel]._target)
      new_url=input(" Pls input new URL: ")
      rels[rel]._target=new_url

out_file=docx_file + "-out.docx"

document.save(out_file)

print("\n File saved to: ", out_file)

Thank you, Lapyiu Ho