Publicly Available Spam Filter Training Set

JeremyKun picture JeremyKun · Jan 20, 2011 · Viewed 56.1k times · Source

I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).

I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just haven't been able to find the same sort of thing for emails.

Answer

JeremyKun picture JeremyKun · Jan 22, 2011

Here is what I was looking for: http://untroubled.org/spam/

This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com