Why does re.sub replace the entire pattern, not just a capturing group within it?

Nick picture Nick · Feb 8, 2017 · Viewed 21.1k times · Source

re.sub('a(b)','d','abc') yields dc, not adc.

Why does re.sub replace the entire capturing group, instead of just capturing group'(b)'?

Answer

yeputons picture yeputons · Feb 8, 2017

Because it's supposed to replace the whole occurrence of the pattern:

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.

If it were to replace only some subgroup, then complex regexes with several groups wouldn't work. There are several possible solutions:

  1. Specify pattern in full: re.sub('ab', 'ad', 'abc') - my favorite, as it's very readable and explicit.
  2. Capture groups which you want to preserve and then refer to them in the pattern (note that it should be raw string to avoid escaping): re.sub('(a)b', r'\1d', 'abc')
  3. Similar to previous option: provide a callback function as repl argument and make it process the Match object and return required result.
  4. Use lookbehinds/lookaheds, which are not included in the match, but affect matching: re.sub('(?<=a)b', r'd', 'abxb') yields adxb. The ?<= in the beginning of the group says "it's a lookahead".