I am thinking of training word2vec on huge large scale data of more than 10 TB+ in size on web crawl dump.
I personally trained c implementation GoogleNews-2012 dump (1.5gb) on my iMac took about 3 hours to train and generate vectors (impressed with speed). I did not try python implementation though :( I read somewhere that generating vectors on wiki dump (11gb) of 300 vector length takes about 9 days to generate.
How to speed up word2vec? Do i need to use distributed models or what type of hardware i need to do it within 2-3 days? i have iMac with 8gb ram.
Which one is faster? Gensim python or C implemention?
I see that word2vec implementation does not support GPU training.
There are a number of opportunities to create Word2Vec models at scale. As you pointed out, candidate solutions are distributed (and/or multi-threaded) or GPU. This is not an exhaustive list but hopefully you get some ideas as to how to proceed.
Distributed / Multi-threading options:
A number of Word2Vec GPU implementations exist. Given the large dataset size, and limited GPU memory you may have to consider a clustering strategy.
There are a number of other CUDA implementations of Word2Vec, at varying degrees of maturity and support:
I believe the SparkML team has recently got going a prototype cuBLAS-based Word2Vec implementation. You may want to investigate this.