I would like to write a script which can search for and report on Personally Identifiable Information like card numbers, etc in a file system. I would like to find it in txt as well as xls word and PDF files.
Any starting tips or which lib to use are welcome.
I'd also like advice on an efficient way to scan large files for patterns like credit cards etc.
give piianalyzer a shot: https://pypi.python.org/pypi/piianalyzer/0.1.0
or you can write your own and use a common regular expression dataset like https://github.com/madisonmay/CommonRegex