I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:
0 the 6
10 chased 3
110 dog 2
1110 mouse 2
1111 cat 2
What does the binary and the integer mean?
From the first link, the binary is known as a bit-string
, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/
But how do I tell from the output that dog and mouse and cat
is one cluster and the and chased
is not in the same cluster?
If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L
characters.
For example, cutting at the second character gives you two clusters
10 chased
11 dog
11 mouse
11 cat
At the third character you get
110 dog
111 mouse
111 cat
The cutting strategy is a different subject though.