I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.
I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.
To run the sample program I do
$ mahout seqdirectory --input uscensus --output uscensus-seq
$ mahout seq2sparse -i uscensus-seq -o uscensus-vec
$ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID. Because of the input format I can not use seqdirectory right away. I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:
For getting your data in SequenceFile format, you have a couple of strategies you can take. Both involve writing your own code -- i.e., not strictly command-line.
Strategy 1 Use Mahout's CSVVectorIterator class. You pass it a java.io.Reader and it will read in your CSV file, turn each row into a DenseVector. I've never used this, but saw it in the API. Looks straight-forward enough if you're ok with DenseVectors.
Strategy 2 Write your own parser. This is really easy, since you just split each line on "," and you have an array you can loop through. For each array of values in each line, you instantiate a vector using something like this:
new DenseVector(<your array here>);
and add it to a List (for example).
Then ... once you have a List of Vectors, you can write them to SequenceFiles using something like this (I'm using NamedVectors in below code):
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
List<NamedVector> vectors = <here's your List of vectors obtained from CSVVectorIterator>;
// Write the data to SequenceFile
try {
fs = FileSystem.get(conf);
Path path = new Path(<your path> + <your filename>);
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector vector : dataVector) {
vec.set(vector);
writer.append(new Text(vector.getName()), vec);
}
writer.close();
} catch (Exception e) {
System.out.println("ERROR: "+e);
}
Now you have a directory of "points" in SequenceFile format that you can use for your K-means clustering. You can point the command line Mahout commands at this directory as input.
Anyway, that's the general idea. There are probably other approaches as well.