Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups
How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?
Let's assume that the email fetching is already in place and we need to focus on classification part only.
The main points which I would hope to get answered would be:
Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome
Thank you
EDIT
If you insist on NNs... I would calculate some features for every email
Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):
You could also add some more features based on the formatting: colors, fonts, sizes, ... used.
Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).
So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.
The inputs would need to be normalized according to your current pre-classified corpus.
I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.