I've a relatively large graph with Vertices: 524 Edges: 1125, of real world transactions. The edges are directed and have a weight (inclusion is optional). I'm trying investigate the various communities within the graph and essentially need a method which:
-Calculates all possible communities
-Calculates the optimum number of communities
-Returns the members/# of members of each (optimum) community
So far I've managed to pull together the following code which plots a color coded graph corresponding to the various communities, however I've no idea how to control the number of communities (i.e plot the top 5 communities with the highest membership) or list the members of a particular community.
library(igraph)
edges <- read.csv('http://dl.dropbox.com/u/23776534/Facebook%20%5BEdges%5D.csv')
all<-graph.data.frame(edges)
summary(all)
all_eb <- edge.betweenness.community(all)
mods <- sapply(0:ecount(all), function(i) {
all2 <- delete.edges(all, all_eb$removed.edges[seq(length=i)])
cl <- clusters(all2)$membership
modularity(all, cl)
})
plot(mods, type="l")
all2<-delete.edges(all, all_eb$removed.edges[seq(length=which.max(mods)-1)])
V(all)$color=clusters(all2)$membership
all$layout <- layout.fruchterman.reingold(all,weight=V(all)$weigth)
plot(all, vertex.size=4, vertex.label=NA, vertex.frame.color="black", edge.color="grey",
edge.arrow.size=0.1,rescale=TRUE,vertex.label=NA, edge.width=.1,vertex.label.font=NA)
Because the edge betweenness method performed so poorly I tried again using the walktrap method:
all_wt<- walktrap.community(all, steps=6,modularity=TRUE,labels=TRUE)
all_wt_memb <- community.to.membership(all, all_wt$merges, steps=which.max(all_wt$modularity)-1)
colbar <- rainbow(20)
col_wt<- colbar[all_wt_memb$membership+1]
l <- layout.fruchterman.reingold(all, niter=100)
plot(all, layout=l, vertex.size=3, vertex.color=col_wt, vertex.label=NA,edge.arrow.size=0.01,
main="Walktrap Method")
all_wt_memb$csize
[1] 176 13 204 24 9 263 16 2 8 4 12 8 9 19 15 3 6 2 1
19 clusters - Much better!
Now say I had a "known cluster" with a list of its members and and wanted to check each of the observed clusters for the presence of members from the "known cluster". Returning the percentage of members found. Unable to finish the following??
list<-read.csv("http://dl.dropbox.com/u/23776534/knownlist.csv")
ength(all_wt_memb$csize) #19
for(i in 1:length(all_wt_memb$csize))
{
match((V(all)[all_wt_memb$membership== i]),list)
}
A couple of these questions can be discovered by closely looking at the documentation of the functions you're using. For instance, the documentation of clusters
, in the "Values" section, describes what will be returned from the function, a couple of which answer your questions. Documentation aside, you can always use the str
function to analyze the make-up of any particular object.
That being said, to get the members or numbers of members in a particular community, you can look at the membership
object returned by the clusters
function (which you're already using to assign color). So something like:
summary(clusters(all2)$membership)
would describe the IDs of the clusters that are being used. In the case of your sample data, it looks like you have clusters with the IDs ranging from 0 to 585, for 586 clusters in total. (Note that you won't be able to display those very accurately using the coloring scheme you're currently using.)
To determine the number of vertices in each cluster, you can look at the csize
component also returned by clusters
. In this case, it's a vector of length 586, storing one size for each cluster calculated. So you can use
clusters(all2)$csize
to get the list of sizes of your clusters. Be warned that your clusterIDs, as previously mentioned, start from 0 ("zero-indexed") whereas R vectors start from 1 ("one-indexed"), so you'll need to shift these indices by one. For instance, clusters(all2)$csize[5]
returns the size of the cluster with the ID of 4.
To list the vertices in any cluster, you just want to find which IDs in the membership
component previously mentioned match up to the cluster in question. So if I want to find the vertices in cluster #128 (there are 21 of these, according to clusters(all2)$csize[129]
), I could use:
which(clusters(all2)$membership == 128)
length(which(clusters(all2)$membership == 128)) #21
and to retrieve the vertices in that cluster, I can use the V
function and pass in the indices which I just computed which are a member of that cluster:
> V(all2)[clusters(all2)$membership == 128]
Vertex sequence:
[1] "625591221 - Clare Clancy"
[2] "100000283016052 - Podge Mooney"
[3] "100000036003966 - Jennifer Cleary"
[4] "100000248002190 - Sarah Dowd"
[5] "100001269231766 - LirChild Surfwear"
[6] "100000112732723 - Stephen Howard"
[7] "100000136545396 - Ciaran O Hanlon"
[8] "1666181940 - Evion Grizewald"
[9] "100000079324233 - Johanna Delaney"
[10] "100000097126561 - Órlaith Murphy"
[11] "100000130390840 - Julieann Evans"
[12] "100000216769732 - Steffan Ashe"
[13] "100000245018012 - Tom Feehan"
[14] "100000004970313 - Rob Sheahan"
[15] "1841747558 - Laura Comber"
[16] "1846686377 - Karen Ni Fhailliun"
[17] "100000312579635 - Anne Rutherford"
[18] "100000572764945 - Lit Đ Jsociety"
[19] "100003033618584 - Fall Ball"
[20] "100000293776067 - James O'Sullivan"
[21] "100000104657411 - David Conway"
That would cover the basic igraph questions you had. The other questions are more graph-theory related. I don't know of a way to supervise the number of clusters to be created using iGraph, but someone may be able to point you to a package which is able to do that. You may have more success posting that as a separate question, either here or in another venue.
Regarding your first points of wanting to iterate through all possible communities, I think you'll find that to be unfeasible for a graph of significant size. The number of possible arrangements of the membership
vector for 5 different clusters would be 5^n, where n is the size of the graph. If you want to find "all possible communities", that number will actually be O(n^n), if my mental math is correct. Essentially, it would be impossible to calculate that exhaustively over any reasonably size network, even given massive computational resources. So I think you'll be better off using some sort of intelligence/optimization for determining the number of communities represented in your graph, as the clusters
function does.