git-subtree is not retaining history so I cannot push subtree changes, how can I fix this/avoid this issue in the future?

Screndib picture Screndib · Apr 22, 2011 · Viewed 7.4k times · Source

I've been using the git-subtree extension (https://github.com/apenwarr/git-subtree) to manage sub-projects within our main project. It's doing exactly what I want other than the fact that it fails when I try to split out changes made to a sub-project from our main project.

e.g. earlier on I had done

git subtree add -P Some/Sub/Dir --squash git@gitserver:lib.git master

to bring in the library code to Some/Sub/Dir in our main project. Everything here went great so I then pushed my changes to our central main project bare git repo. I then decide to make a change to my local version of the lib in Some/Sub/Dir, commit it, then split it out to push it back to the lib.git repo

git subtree split -P Some/Sub/Dir -b some_branch

everything works as expected. No longer needing the local copy of the repo I deleted it.

After cloning a new copy of the repo from our central repo I made some changes to the lib in Some/Sub/Dir and decided I wanted to split those changes out and push them back to the lib.git repository. I attempt to use the same subtree split command as before, however this time I end up with the following output:

1/      3 (0)
2/      3 (1)
3/      3 (1)
fatal: bad object d76a03f0ec7e20724bcfa253e6a03683211a7bb1

d76a03f0ec7e20724bcfa253e6a03683211a7bb1 comes from when I added the subtree:

commit 43b3eb7d69d5eb64241eddb12e5bd74fd0215083
Author: Ian Bond <[email protected]>
Date:   Fri Apr 22 15:06:50 2011 -0400

    Squashed 'Subtree/librepoLib/' content from commit d76a03f

    git-subtree-dir: Subtree/librepoLib
    git-subtree-split: d76a03f0ec7e20724bcfa253e6a03683211a7bb1

which actually refers to a commit in the lib.git repo.


What I've been able to piece together (and I'm a git noob so I may be wrong, overlooking something, or using incorrect terminology here), is that 'git subtree add --squash' will bring in the entire history from the remote lib.git repo into the current repo, squash it down into a separate commit, then add that commit into the working branch. The lib.git commit history remains in the current repo, however they're dangling commits since they're not actually referenced other than through the text of the squash commit. As long as those dangling commits remain, git-subtree can use them to perform splits, however since a push or pull doesn't contain dangling objects (or if I run a gc and fully prune dangling objects), those dangling commits are lost and git-subtree no longer has the necessary information to perform the split.

I've added a script that will fully reproduce the issues I've been having.


My questions are:

1) What can I do to handle the existing situation where I now have subtrees that I want to merge back to their origin repo, but no longer have any sort of history that links them together. My current thought is to do something like:

git subtree split -P Some/Sub/Dir 43b3eb7^.. --ignore-joins -b splitBranch

to split out all of the history since the 'git subtree add' and merge it back into the origin repo (which thankfully has not had any changes since the add). Is this the best way to go? Any recommendations for how I should perform the merge?

2) Is there anything I can do to make git-subtree work as expected? I believe if I omit the --squash parameter on 'git subtree add' then everything will work, however that causes a bunch of unrelated history to be injected into my repo. Is there some way to keep the needed commits around (preferably without keeping the entire history of the library around)?

Answer

Chris Johnsen picture Chris Johnsen · Apr 23, 2011

The purpose of git subtree split is to create some new commits (representing “local” changes originally made in the subtree’s local directory) on top of the subtree’s original history. Since it directly involves the subtree’s original history (as the parent commit of the first rewritten local commit that touches the subtree), the split operation can not be done without the subtree’s original history itself being present.

Think about what you will be doing with the history that git subtree split generates. You will probably want to push it to a repository where you can merge it into the rest of the “upstream” history. In order for this merge operation to make sense, the split history needs to be based on the original history itself1.

Probably the most reliable way to arrange for users to have the subtree’s original history is to publish the URL for the subtree’s upstream repository in your documentation and have them define a remote for it (it is perfectly fine to have “unrelated” remotes in a single repository). E.g.

If you need to work with the “upstream” of Some/Sub/Dir (to pull in external changes or push out local changes), please define and update a remote for the library’s repository before using git subtree:

git remote add lib git@host:the-lib-repository &&
git fetch lib

You would need to do something like this even if you were not using --squash since users would need to know where to get new upstream commits (and where (ultimately) to push new split-generated commits).

Using --squash gives you a “clean” history in your main project and means that only those users that need to deal with the subtree’s “upstream” actually have to have its objects in their repositories.


It seems like you have a good understanding of the object model. You are correct that the history that git subtree add --squash pulls in will become dangling2 but that git subtree split can still use it until it is pruned away.

(with reference to your reproduction script)
You are able to successfully split in your repoMainClone only because local clones automatically hardlink (or copy) all the files in .git/objects/ (thus getting access to repoMain’s copies of the dangling (or nearly dangling2) objects from repoLib) instead of using the usual “pack protocol” transport (which would limit the transferred objects to only those needed for the transferred refs; i.e. omitting anything from repoLib). Your repoMainPull is effectively equivalent cloning file://"$(pwd)"/repoMain repoMainCloneFile (the file:// URL forces local clones to use pack-based transfers instead of just linking/copying everything).


1 Actually, you can directly merge unrelated histories, but you lose the ability to do three-way merges (since there is no common ancestor). This would be quite a sacrifice.

Your proposed git subtree split -P Some/Sub/Dir 43b3eb7^.. --ignore-joins … (where 43b3eb7 is the synthetic commit that resulted from git subtree add --squash …), would generate an unrelated history (except it needs to be 43b3eb7.. since 43b3eb7^ means “the first parent of 43b3eb7” and 43b3eb7 has no parents). I am not sure that git subtree split was designed to take ranges like this though. The documentation for git subtree split just says <commit>, but never really mentions its purpose. Reading the code shows that it defaults to HEAD, which might indicate that it is intended to be a single commit specifying the “tip” of the history that should be processed for splitting. Also, turning on the debug output shows a message incorrect order: which might indicate that using a range argument is putting the split operation in an unexpected situation (it is expecting to have processed all of the parents of a commit before processing the commit itself, but the range ensures that 43b3eb7 (which is the parent of the subtree merge commit) is never processed). I think you can just use --ignore-splits and leave off the range if you want to generate “unrelated” history and try to use it in some way: git subtree split -P Some/Sub/Dir --ignore-joins ….

2 They are not actually dangling immediately after git subtree add --squash because they are still referenced by FETCH_HEAD. Once an unrelated fetch is done, however, they will become truly dangling.