Can I have a non-greedy regex with dotall?

rich tier picture rich tier · Feb 29, 2012 · Viewed 14.3k times · Source

I would like to match dotall and non-greedy. This is what I have:

img(.*?)(onmouseover)+?(.*?)a

However, this is not being non-greedy. This data is not matching as I expected:

<img src="icon_siteItem.gif" alt="siteItem" title="A version of this resource is available on siteItem" border="0"></a><br><br></td><td rowspan="4" width="20"></td></tr><tr><td>An activity in which students find other more specific adjectives to 
describe a range of nouns, followed by writing a postcard to describe a 
nice holiday without using the word 'nice'.</td></tr><tr><td>From the resource collection: <a href="http://www.siteItem.co.uk/index.asp?CurrMenu=searchresults&amp;tag=326" title="Resources to help work">Drafting </a></td></tr><tr><td><abbr style="border-bottom:0px" title="Key Stage 3">thing</abbr> | <abbr style="border-bottom:0px" title="Key Stage 4">hello</abbr> | <abbr style="border-bottom:0px" title="Resources">Skills</abbr></td></tr></tbody></table></div></div></td></tr><tr><td><div style="padding-left: 30px"><div><table style="" bgcolor="#DFE7EE" border="0" cellpadding="0" cellspacing="5" width="100%"><tbody><tr valign="top"><td rowspan="4" width="60"><a href="javascript:requiresLevel0(350,350);"><img name="/attachments/3700.pdf" onmouseover="ChangeImageOnRollover(this,'/application/files/images/attach_icons/rollover_pdf.gif')" onmouseout="ChangeImageOnRollover(this,'/application/files/images/attach_icons/small_pdf.gif')" src="small_pdf.gif" alt="Download Recognising and avoiding ambiguity in PDF format" title="Download in PDF format" style="vertical-align: middle;" border="0"></a><br>790.0 k<br>

and I cannot understand why.

What I think I am stating in the above regex is:

start with "img", then allow 0 or more any character including new line, then look for at least 1 "onmouseover", then allow 0 or more any character including new line, then an "a"

Why doesn't this work as I expected?

KEY POINT: dotall must be enabled

Answer

stranac picture stranac · Mar 1, 2012

It is being non-greedy. It is your understanding of non-greedy that is not correct.

A regex will always try to match.

Let me show a simplified example of what non-greedy actually means(as suggested by a comment):

re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)

This will match:

  • as few repetitions of 'a' as possible (in this case 2)
  • followed by a 'b'
  • and as few repetitions of 'c' as possible (in this case 0)

so the only match is 'aab'.

And just to conclude:

Don't use regex to parse HTML. There are libraries that were made for the job. re is not one of them.