I'm interested in network analysis on large networks with millions of nodes and tens of millions of edges. I want to be able to do things like parse networks from many formats, find connected components, detect communities, and run centrality measures like PageRank.
I am attracted to NetworkX because it has a nice api, good documentation, and has been under active development for years. Plus because it is in python, it should be quick to develop with.
In a recent presentation (the slides are available on github here), it was claimed that:
Unlike many other tools, NX is designed to handle data on a scale relevant to modern problems...Most of the core algorithms in NX rely on extremely fast legacy code.
The presentation also states that the base algorithms of NetworkX are implemented in C/Fortran.
However, looking at the source code, it looks like NetworkX is mostly written in python. I am not too familiar with the source code, but I am aware of a couple of examples where NetworkX uses numpy to do heavy lifting (which in turn uses C/Fortran to do linear algebra). For example, the file networkx/networkx/algorithms/centrality/eigenvector.py
uses numpy to calculate eigenvectors.
Does anyone know if this strategy of calling an optimized library like numpy is really prevalent throughout NetworkX, or if just a few algorithms do it? Also can anyone describe other scalability issues associated with NetworkX?
Reply from NetworkX Lead Programmer I posed this question on the NetworkX mailing list, and Aric Hagberg replied:
The data structures used in NetworkX are appropriate for scaling to large problems (e.g. the data structure is an adjacency list). The algorithms have various scaling properties but some of the ones you mention are usable (e.g. PageRank, connected components, are linear complexity in the number of edges).
At this point NetworkX is pure Python code. The adjacency structure is encoded with Python dictionaries which provides great flexibility at the expense of memory and computational speed. Large graphs will take a lot of memory and you will eventually run out.
NetworkX does use NumPy and SciPy for algorithms that are primarily based on linear algebra. In that case the graph is represented (copied) as an adjacency matrix using either NumPy matrices or SciPy sparse matrices. Those algorithms can benefit from the legacy C and FORTRAN code that is used under the hood in NumPy and SciPY.
This is an old question, but I think it is worth mentioning that graph-tool has a very similar functionality to NetworkX, but it is implemented in C++ with templates (using the Boost Graph Library), and hence is much faster (up to two orders of magnitude) and uses much less memory.
Disclaimer: I'm the author of graph-tool.