Can I use autoencoder for clustering?

forever picture forever · Nov 24, 2016 · Viewed 12.8k times · Source

In the below code, they use autoencoder as supervised clustering or classification because they have data labels. http://amunategui.github.io/anomaly-detection-h2o/ But, can I use autoencoder to cluster data if I did not have its labels.? Regards

Answer

Darren Cook picture Darren Cook · Nov 25, 2016

The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.

The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.

m <- h2o.deeplearning(
  2:564, training_frame = tfidf,
  hidden = c(2), auto-encoder = T, activation = "Tanh"
  )
f <- h2o.deepfeatures(m, tfidf, layer = 1)

The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:

simplest autoencoder output

Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.

BTW, I made the above plot with this code:

d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above

(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )