I am really confused how to compute precision and recall in clustering applications.
I have the following situation:
Given two sets A and B. By using a unique key for each element I can determine which of the elements of A and B match. I want to cluster those elements based on features (not using the unique key of course).
I am doing the clustering but I am not sure how to compute precision and recall. The formulas,according to the paper "Extended Performance Graphs for Cluster Retrieval" (http://staff.science.uva.nl/~nicu/publications/CVPR01_nies.pdf) are:
p = precision = relevant retrieved items/retrieved items and r = recall = relevant retrieved items/relevant items
I really do not get what elements fall under which category.
What I did so far is, I checked within the clusters how many matching pairs I have (using the unique key). Is that already one of precision or recall? And if so, which one is it and how can I compute the other one?
Update: I just found another paper with the title "An F-Measure for Evaluation of Unsupervised Clustering with Non-Determined Number of Clusters" at http://mtg.upf.edu/files/publications/unsuperf.pdf.
I think you'll find wikipedia has a helpful article on precision and recall. In short:
Precision = true positives / (true positives + false positives)
Recall = true positives /( true positivies + false negatives)