How to strip color codes used by mIRC users?

daniels picture daniels · Jun 9, 2009 · Viewed 8k times · Source

I'm writing a IRC bot in Python using irclib and I'm trying to log the messages on certain channels.
The issue is that some mIRC users and some Bots write using color codes.
Any idea on how i could strip those parts and leave only the clear ascii text message?

Answer

Smerity picture Smerity · Jun 9, 2009

Regular expressions are your cleanest bet in my opinion. If you haven't used them before, this is a good resource. For the full details on Python's regex library, go here.

import re
regex = re.compile("\x03(?:\d{1,2}(?:,\d{1,2})?)?", re.UNICODE)

The regex searches for ^C (which is \x03 in ASCII, you can confirm by doing chr(3) on the command line), and then optionally looks for one or two [0-9] characters, then optionally followed by a comma and then another one or two [0-9] characters.

(?: ... ) says to forget about storing what was found in the parenthesis (as we don't need to backreference it), ? means to match 0 or 1 and {n,m} means to match n to m of the previous grouping. Finally, \d means to match [0-9].

The rest can be decoded using the links I refer to above.

>>> regex.sub("", "blabla \x035,12to be colored text and background\x03 blabla")
'blabla to be colored text and background blabla'

chaos' solution is similar, but may end up eating more than a max of two numbers and will also not remove any loose ^C characters that may be hanging about (such as the one that closes the colour command)