I have a codebase that (until now) used git to store its dependencies. The repository itself is available here (warning: it's HUGE). Needless to say, I need to remove the dependencies from the repository history in order to cut it down to a reasonable size.
I started by using David Underhill's instructions to remove the lib
directory from the history. Even after doing this, however, the repository is still over 300M. Issuing git prune
and git repack
helps, but it's still over 180M.
In an attempt to find any bloated blobs, I issued
git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head
with these results:
105526b5d3d398b9989d88c2f9fc2d1dc96a85b8 blob 35685609 33600527 31978828 d296935e6ac5f3f58b50c789394c9769116e9c34 blob 35658016 33593241 112485744 50636f931180a32764edadd854968a971a083f8a blob 28360290 25897864 233390 b9e4dd37428e879a258f297b7f5bcfb9ba869695 blob 13108002 11640713 66661788 08d2720b2414aa07ce419b17d5f80c333c7313b7 blob 12551621 11124009 89231035 6197a478a461275a0396f20c28487e9ae619a5f9 blob 11975135 11058259 148211988 1 50636f931180a32764edadd854968a971a083f8a 549eb0c73776fd0ede27a2fcb03366f76f45a13c blob 9136086 8166649 166451273 5bc0a0f04a7004bc16cfab1c091c6b369fb74049 blob 9072616 8270262 80951514 741480238a6a6ce612cf089245dd46d6890fba9f blob 8858569 8080252 101294029 744226651c55b14c1aa8affb78fba4fdf02b577c blob 7412220 6766404 186825167
This is where I'm stuck. I can git show
these blobs and see that they look very much like jar files, but I can't figure out why they're still in the repo.
Various attempts to find their filenames failed.
git repack -a
, git repack -ad
, and git repack -Ad
all seem to have no effect.
--prune=now
on git gcAlthough you'd successfully written your unwanted objects out of history, it looks like those unwanted objects were not being pruned because they were too young to be pruned by default (see the configuration docs on git gc
for a bit more detail). Using git gc --prune=now
should handle that, or you could see this answer for a more nuclear option.
Although that should fix your final problem, an underlying problem was the difficulty of finding big blobs in order to remove them using git filter-branch
- to which I would say:
git filter-branch
is painful to use for a task like this, and there's a much better, less well-known tool called The BFG, specifically designed for removing Large Files from Git repos.
The core command to remove big files looks just like this:
$ bfg --strip-blobs-bigger-than 10MB my-repo.git
Any blob over 10MB in size (that isn't in your latest commit) will be totally removed from your repository's history - you don't have to manually find the files yourself, and files in protected commits are safe.
You can then use git gc
to clean away the dead data:
$ git gc --prune=now --aggressive
The BFG is typically hundreds of times faster than running git-filter-branch
on a big repo and the options are tailored around these two common use-cases:
Full disclosure: I'm the author of the BFG Repo-Cleaner.