Git with large files

Jakub Riedl picture Jakub Riedl · Jul 26, 2013 · Viewed 57.3k times · Source

Situation

I have two servers, Production and Development. On Production server, there are two applications and multiple (6) databases (MySQL) which I need to distribute to developers for testing. All source codes are stored in GitLab on Development server and developers are working only with this server and don't have access to production server. When we release an application, master logs into production and pulls new version from Git. The databases are large (over 500M each and counting) and I need to distribute them as easy as possible to developers for testing.

Possible solutions

  • After a backup script which dumps databases, each to a single file, execute a script which pushes each database to its own branch. A developer pulls one of these branches if he wants to update his local copy.

    This one was found non working.

  • Cron on production server saves binary logs every day and pushes them into the branch of that database. So, in the branch, there are files with daily changes and developer pulls the files he doesn't have. The current SQL dump will be sent to the developer another way. And when the size of the repository becomes too large, we will send full dump to the developers and flush all data in the repository and start from the beginning.

Questions

  • Is the solution possible?
  • If git is pushing/pulling to/from repository, does it upload/download whole files, or just changes in them (i.e. adds new lines or edits the current ones)?
  • Can Git manage so large files? No.
  • How to set how many revisions are preserved in a repository? Doesn't matter with the new solution.
  • Is there any better solution? I don't want to force the developers to download such large files over FTP or anything similar.

Answer

VonC picture VonC · Oct 21, 2013

Update 2017:

Microsoft is contributing to Microsoft/GVFS: a Git Virtual File System which allows Git to handle "the largest repo on the planet"
(ie: the Windows code base, which is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB, and produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds)

GVFS virtualizes the file system beneath your git repo so that git and all tools see what appears to be a normal repo, but GVFS only downloads objects as they are needed.

Some parts of GVFS might be contributed upstream (to Git itself).
But in the meantime, all new Windows development is now (August 2017) on Git.


Update April 2015: GitHub proposes: Announcing Git Large File Storage (LFS)

Using git-lfs (see git-lfs.github.com) and a server supporting it: lfs-test-server, you can store metadata only in the git repo, and the large file elsewhere. Maximum of 2 Gb per commit.

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

See git-lfs/wiki/Tutorial:

git lfs track '*.bin'
git add .gitattributes "*.bin"
git commit -m "Track .bin files"

Original answer:

Regarding what the git limitations with large files are, you can consider bup (presented in details in GitMinutes #24)

The design of bup highlights the three issues that limits a git repo:

  • huge files (the xdelta for packfile is in memory only, which isn't good with large files)
  • huge number of file, which means, one file per blob, and slow git gc to generate one packfile at a time.
  • huge packfiles, with a packfile index inefficient to retrieve data from the (huge) packfile.

Handling huge files and xdelta

The primary reason git can't handle huge files is that it runs them through xdelta, which generally means it tries to load the entire contents of a file into memory at once.
If it didn't do this, it would have to store the entire contents of every single revision of every single file, even if you only changed a few bytes of that file.
That would be a terribly inefficient use of disk space
, and git is well known for its amazingly efficient repository format.

Unfortunately, xdelta works great for small files and gets amazingly slow and memory-hungry for large files.
For git's main purpose, ie. managing your source code, this isn't a problem.

What bup does instead of xdelta is what we call "hashsplitting."
We wanted a general-purpose way to efficiently back up any large file that might change in small ways, without storing the entire file every time. We read through the file one byte at a time, calculating a rolling checksum of the last 128 bytes.

rollsum seems to do pretty well at its job. You can find it in bupsplit.c.
Basically, it converts the last 128 bytes read into a 32-bit integer. What we then do is take the lowest 13 bits of the rollsum, and if they're all 1's, we consider that to be the end of a chunk.
This happens on average once every 2^13 = 8192 bytes, so the average chunk size is 8192 bytes.
We're dividing up those files into chunks based on the rolling checksum.
Then we store each chunk separately (indexed by its sha1sum) as a git blob.

With hashsplitting, no matter how much data you add, modify, or remove in the middle of the file, all the chunks before and after the affected chunk are absolutely the same.
All that matters to the hashsplitting algorithm is the 32-byte "separator" sequence, and a single change can only affect, at most, one separator sequence or the bytes between two separator sequences.
Like magic, the hashsplit chunking algorithm will chunk your file the same way every time, even without knowing how it had chunked it previously.

The next problem is less obvious: after you store your series of chunks as git blobs, how do you store their sequence? Each blob has a 20-byte sha1 identifier, which means the simple list of blobs is going to be 20/8192 = 0.25% of the file length.
For a 200GB file, that's 488 megs of just sequence data.

We extend the hashsplit algorithm a little further using what we call "fanout." Instead of checking just the last 13 bits of the checksum, we use additional checksum bits to produce additional splits.
What you end up with is an actual tree of blobs - which git 'tree' objects are ideal to represent.

Handling huge numbers of files and git gc

git is designed for handling reasonably-sized repositories that change relatively infrequently. You might think you change your source code "frequently" and that git handles much more frequent changes than, say, svn can handle.
But that's not the same kind of "frequently" we're talking about.

The #1 killer is the way it adds new objects to the repository: it creates one file per blob. Then you later run 'git gc' and combine those files into a single file (using highly efficient xdelta compression, and ignoring any files that are no longer relevant).

'git gc' is slow, but for source code repositories, the resulting super-efficient storage (and associated really fast access to the stored files) is worth it.

bup doesn't do that. It just writes packfiles directly.
Luckily, these packfiles are still git-formatted, so git can happily access them once they're written.

Handling huge repository (meaning huge numbers of huge packfiles)

Git isn't actually designed to handle super-huge repositories.
Most git repositories are small enough that it's reasonable to merge them all into a single packfile, which 'git gc' usually does eventually.

The problematic part of large packfiles isn't the packfiles themselves - git is designed to expect the total size of all packs to be larger than available memory, and once it can handle that, it can handle virtually any amount of data about equally efficiently.
The problem is the packfile indexes (.idx) files.

each packfile (*.pack) in git has an associated idx (*.idx) that's a sorted list of git object hashes and file offsets.
If you're looking for a particular object based on its sha1, you open the idx, binary search it to find the right hash, then take the associated file offset, seek to that offset in the packfile, and read the object contents.

The performance of the binary search is about O(log n) with the number of hashes in the pack, with an optimized first step (you can read about it elsewhere) that somewhat improves it to O(log(n)-7).
Unfortunately, this breaks down a bit when you have lots of packs.

To improve performance of this sort of operation, bup introduces midx (pronounced "midix" and short for "multi-idx") files.
As the name implies, they index multiple packs at a time.