Detecting 'unusual behavior' using machine learning with CouchDB and Python?

sullivanmatt picture sullivanmatt · Oct 31, 2013 · Viewed 7.1k times · Source

I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.

I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.

Any insight would be much appreciated. Thanks!

Answer

danodonovan picture danodonovan · Nov 4, 2013

You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.

If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that @Robert McGibbon suggested comes into play.

For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);

  • log-in time (modulo 24 hrs)
  • location (maybe an array of integer locations, each integer representing a different country)
  • user-agent (a similar array of integer user-agents)

and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.

So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:

from sklearn import svm

# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()

# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)

and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM

if clf.predict(log_in_data) < 0:
    fire_alert_event()
else:
    # log-in is not dissimilar to previous attempts
    print('log in ok')

if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.

My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.

There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!