I'm doing a project that deals with structured document database. I have a tree of categories (~1000 categories, up to ~50 categories on each level), each category contains several thousands (up to, say, ~10000) of structured documents. Each document is several kilobytes of data in some structured form (I'd prefer YAML, but it may just as well be JSON or XML).
Users of this systems do several types of operations:
Of course, the traditional solution would be using some sort of document database (such as CouchDB or Mongo) for this problem - however, this version control (history) thing tempted me to a wild idea - why shouldn't I use git
repository as a database backend for this application?
On the first glance, it could be solved like this:
Are there any other common pitfalls in this solution? Have anyone tried to implement such backend already (i.e. for any popular frameworks - RoR, node.js, Django, CakePHP)? Does this solution have any possible implications on performance or reliability - i.e. is it proven that git would be much slower than traditional database solutions or there would be any scalability/reliability pitfalls? I presume that a cluster of such servers that push/pull each other's repository should be fairly robust & reliable.
Basically, tell me if this solution will work and why it will or won't do?
Answering my own question is not the best thing to do, but, as I ultimately dropped the idea, I'd like to share on the rationale that worked in my case. I'd like to emphasize that this rationale might not apply to all cases, so it's up to architect to decide.
Generally, the first main point my question misses is that I'm dealing with multi-user system that work in parallel, concurrently, using my server with a thin client (i.e. just a web browser). This way, I have to maintain state for all of them. There are several approaches to this one, but all of them are either too hard on resources or too complex to implement (and thus kind of kill the original purpose of offloading all the hard implementation stuff to git in the first place):
"Blunt" approach: 1 user = 1 state = 1 full working copy of a repository that server maintains for user. Even if we're talking about fairly small document database (for example, 100s MiBs) with ~100K of users, maintaining full repository clone for all of them makes disc usage run through the roof (i.e. 100K of users times 100MiB ~ 10 TiB). What's even worse, cloning 100 MiB repository each time takes several seconds of time, even if done in fairly effective maneer (i.e. not using by git and unpacking-repacking stuff), which is non acceptable, IMO. And even worse — every edit that we apply to a main tree should be pulled to every user's repository, which is (1) resource hog, (2) might lead to unresolved edit conflicts in general case.
Basically, it might be as bad as O(number of edits × data × number of users) in terms of disc usage, and such disc usage automatically means pretty high CPU usage.
"Only active users" approach: maintain working copy only for active users. This way, you generally store not a full-repo-clone-per-user, but:
Thus, disc usage in this case peaks at O(number of edits × data × number of active users), which is usually ~100..1000 times less than number of total users, but it makes logging in/out more complicated and slower, as it involves cloning of a per-user branch on every login and pulling these changes back on logout or session expiration (which should be done transactionally => adds another layer of complexity). In absolute numbers, it drops 10 TiBs of disc usage down to 10..100 GiBs in my case, that might be acceptable, but, yet again, we're now talking about fairly small database of 100 MiBs.
"Sparse checkout" approach: making "sparse checkout" instead of full-blown repo clone per active user doesn't help a lot. It might save ~10x of disc space usage, but at expense of much higher CPU/disc load on history-involving operations, which kind of kills the purpose.
"Workers pool" approach: instead of doing full-blown clones every time for active person, we might keep a pool of "worker" clones, ready to be used. This way, every time a users logs in, he occupies one "worker", pulling there his branch from main repo, and, as he logs out, he frees the "worker", which does clever git hard reset to become yet again just a main repo clone, ready to be used by another user logging in. Does not help much with disc usage (it's still pretty high — only full clone per active user), but at least it makes logging in/out faster, as expense of even more complexity.
That said, note that I intentionally calculated numbers of fairly small database and user base: 100K users, 1K active users, 100 MiBs total database + history of edits, 10 MiBs of working copy. If you'd look at more prominent crowd-sourcing projects, there are much higher numbers there:
│ │ Users │ Active users │ DB+edits │ DB only │
├──────────────┼───────┼──────────────┼──────────┼─────────┤
│ MusicBrainz │ 1.2M │ 1K/week │ 30 GiB │ 20 GiB │
│ en.wikipedia │ 21.5M │ 133K/month │ 3 TiB │ 44 GiB │
│ OSM │ 1.7M │ 21K/month │ 726 GiB │ 480 GiB │
Obviously, for that amounts of data/activity, this approach would be utterly unacceptable.
Generally, it would have worked, if one could use web browser as a "thick" client, i.e. issuing git operations and storing pretty much the full checkout on client's side, not on the server's side.
There are also other points that I've missed, but they're not that bad compared to the first one:
So, bottom line: it is possible, but for most current usecases it won't be anywhere near the optimal solution. Rolling up your own document-edit-history-to-SQL implementation or trying to use any existing document database would be probably a better alternative.