I have two servers, Production and Development. On Production server, there are two applications and multiple (6) databases (MySQL) which I need to distribute to developers for testing. All source codes are stored in GitLab on Development server and developers are working only with this server and don't have access to production server. When we release an application, master logs into production and pulls new version from Git. The databases are large (over 500M each and counting) and I need to distribute them as easy as possible to developers for testing.
After a backup script which dumps databases, each to a single file, execute a script which pushes each database to its own branch. A developer pulls one of these branches if he wants to update his local copy.
This one was found non working.
Cron on production server saves binary logs every day and pushes them into the branch of that database. So, in the branch, there are files with daily changes and developer pulls the files he doesn't have. The current SQL dump will be sent to the developer another way. And when the size of the repository becomes too large, we will send full dump to the developers and flush all data in the repository and start from the beginning.
Update 2017:
Microsoft is contributing to Microsoft/GVFS: a Git Virtual File System which allows Git to handle "the largest repo on the planet"
(ie: the Windows code base, which is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB, and produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds)
GVFS virtualizes the file system beneath your git repo so that git and all tools see what appears to be a normal repo, but GVFS only downloads objects as they are needed.
Some parts of GVFS might be contributed upstream (to Git itself).
But in the meantime, all new Windows development is now (August 2017) on Git.
Update April 2015: GitHub proposes: Announcing Git Large File Storage (LFS)
Using git-lfs (see git-lfs.github.com) and a server supporting it: lfs-test-server, you can store metadata only in the git repo, and the large file elsewhere. Maximum of 2 Gb per commit.
git lfs track '*.bin'
git add .gitattributes "*.bin"
git commit -m "Track .bin files"
Original answer:
Regarding what the git limitations with large files are, you can consider bup (presented in details in GitMinutes #24)
The design of bup highlights the three issues that limits a git repo:
git gc
to generate one packfile at a time.xdelta
The primary reason git can't handle huge files is that it runs them through
xdelta
, which generally means it tries to load the entire contents of a file into memory at once.
If it didn't do this, it would have to store the entire contents of every single revision of every single file, even if you only changed a few bytes of that file.
That would be a terribly inefficient use of disk space, and git is well known for its amazingly efficient repository format.
Unfortunately,
xdelta
works great for small files and gets amazingly slow and memory-hungry for large files.
For git's main purpose, ie. managing your source code, this isn't a problem.
What bup does instead of xdelta is what we call "
hashsplitting
."
We wanted a general-purpose way to efficiently back up any large file that might change in small ways, without storing the entire file every time. We read through the file one byte at a time, calculating a rolling checksum of the last 128 bytes.
rollsum
seems to do pretty well at its job. You can find it inbupsplit.c
.
Basically, it converts the last 128 bytes read into a 32-bit integer. What we then do is take the lowest 13 bits of the rollsum, and if they're all 1's, we consider that to be the end of a chunk.
This happens on average once every2^13 = 8192 bytes
, so the average chunk size is 8192 bytes.
We're dividing up those files into chunks based on the rolling checksum.
Then we store each chunk separately (indexed by its sha1sum) as a git blob.
With hashsplitting, no matter how much data you add, modify, or remove in the middle of the file, all the chunks before and after the affected chunk are absolutely the same.
All that matters to the hashsplitting algorithm is the 32-byte "separator" sequence, and a single change can only affect, at most, one separator sequence or the bytes between two separator sequences.
Like magic, the hashsplit chunking algorithm will chunk your file the same way every time, even without knowing how it had chunked it previously.
The next problem is less obvious: after you store your series of chunks as git blobs, how do you store their sequence? Each blob has a 20-byte sha1 identifier, which means the simple list of blobs is going to be
20/8192 = 0.25%
of the file length.
For a 200GB file, that's 488 megs of just sequence data.
We extend the hashsplit algorithm a little further using what we call "fanout." Instead of checking just the last 13 bits of the checksum, we use additional checksum bits to produce additional splits.
What you end up with is an actual tree of blobs - which git 'tree' objects are ideal to represent.
git gc
git is designed for handling reasonably-sized repositories that change relatively infrequently. You might think you change your source code "frequently" and that git handles much more frequent changes than, say,
svn
can handle.
But that's not the same kind of "frequently" we're talking about.
The #1 killer is the way it adds new objects to the repository: it creates one file per blob. Then you later run 'git gc' and combine those files into a single file (using highly efficient xdelta compression, and ignoring any files that are no longer relevant).
'
git gc
' is slow, but for source code repositories, the resulting super-efficient storage (and associated really fast access to the stored files) is worth it.
bup
doesn't do that. It just writes packfiles directly.
Luckily, these packfiles are still git-formatted, so git can happily access them once they're written.
Git isn't actually designed to handle super-huge repositories.
Most git repositories are small enough that it's reasonable to merge them all into a single packfile, which 'git gc
' usually does eventually.
The problematic part of large packfiles isn't the packfiles themselves - git is designed to expect the total size of all packs to be larger than available memory, and once it can handle that, it can handle virtually any amount of data about equally efficiently.
The problem is the packfile indexes (.idx
) files.
each packfile (
*.pack
) in git has an associatedidx
(*.idx
) that's a sorted list of git object hashes and file offsets.
If you're looking for a particular object based on its sha1, you open the idx, binary search it to find the right hash, then take the associated file offset, seek to that offset in the packfile, and read the object contents.
The performance of the binary search is about
O(log n)
with the number of hashes in the pack, with an optimized first step (you can read about it elsewhere) that somewhat improves it toO(log(n)-7)
.
Unfortunately, this breaks down a bit when you have lots of packs.
To improve performance of this sort of operation, bup introduces
midx
(pronounced "midix" and short for "multi-idx") files.
As the name implies, they index multiple packs at a time.