MongoDB: Sharding on single machine. Does it make sense?

whysoserious picture whysoserious · Jun 25, 2011 · Viewed 8.7k times · Source

created a collection in MongoDB consisting of 11446615 documents.

Each document has the following form:

{ 
 "_id" : ObjectId("4e03dec7c3c365f574820835"), 
 "httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1", 
 "words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],     
 "howMany" : 3 
}

httpReferer: just an url

words: words parsed from the url above. Size of the list is between 15 and 90.

I am planning to use this database to obtain list of webpages which have similar content.

I 'll by querying this collection using words field so I created (or rather started creating) index on this field:

db.my_coll.ensureIndex({words: 1})

Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):

  1. Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
  2. Indexing before inserting It would take a few days to insert all data to collection.

My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.

Now, time for a question:

I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?

Answer

EhevuTov picture EhevuTov · Feb 22, 2012

Yes, it does make sense to shard on a single server.

  1. At this time, MongoDB still uses a global lock per mongodb server. Creating multiple servers will release a server from one another's locks.

  2. If you run a multiple core machine with seperate NUMAs, this can also increase performance.

  3. If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.

Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.