created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):
My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.
Now, time for a question:
I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?
Yes, it does make sense to shard on a single server.
At this time, MongoDB still uses a global lock per mongodb server. Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.
Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.