Replace SRC of all IMG elements using Parser

CPCase picture CPCase · Oct 16, 2009 · Viewed 8.4k times · Source

I am looking for a way to replace the SRC attribute in all IMG tags not using Regular expressions. (Would like to use any out-of-the box HTML parser included with default Python install) I need to reduce the source from what ever it may be to:

<img src="cid:imagename">

I am trying to replace all src tags to point to the cid of an attachment for an HTML email so I will also need to change whatever the source is so it's simply the file name without the path or extension.

Answer

Luk&#225;š Lalinsk&#253; picture Lukáš Lalinský · Oct 16, 2009

There is a HTML parser in the Python standard library, but it’s not very useful and it’s deprecated since Python 2.6. Doing this kind of things with BeautifulSoup is really easy:

from BeautifulSoup import BeautifulSoup
from os.path import basename, splitext
soup = BeautifulSoup(my_html_string)
for img in soup.findAll('img'):
    img['src'] = 'cid:' + splitext(basename(img['src']))[0]
my_html_string = str(soup)