Creating a GitHub repository with only a subset of a local repository's history

Seth Johnson picture Seth Johnson · Apr 20, 2011 · Viewed 7.4k times · Source

The background: I'm moving closer to open sourcing a personal research code I've been working on for more than two years. It started life as an SVN repository, but I moved to Git about a year ago, and I'd like to share the code on GitHub. However, it accumulated a lot of cruft over the years, and I'd prefer that the public version begin its life at its current status. However, I'd still like to contribute to it and incorporate other people's potential contributions.

The question: is there a way to "fork" a Git repository such that no history is retained on the fork (which lives on GitHub), but that my local repository still has a complete history, and I can pull/push to GitHub?

I don't have any experience in the administrating end of large repositories, so detail is very much appreciated.

Answer

Brian Campbell picture Brian Campbell · Apr 20, 2011

You can create a new, fresh history quite easily in Git. Let’s say you want your master branch to be the one that you will push to GitHub, and your full history to be stored in old-master. You can just move your master branch to old-master, and then start a fresh new branch with no history using git checkout --orphan:

git branch -m master old-master
git checkout --orphan master
git commit -m "Import clean version of my code"

Now you have a new master branch with no history, which you can push to GitHub. But, as you say, you would like to be able to see all of the old history in your local repository; and would probably like for it to not be disconnected.

You can do this using git replace. A replacement ref is a way of specifying an alternate commit any time Git looks at a given commit. So you can tell Git to look at the last commit of your old branch, instead of the first commit of your new branch, when looking at history. In order to do this, you need to bring in the disconnected history from the old repository.

git replace master old-master

Now you have your new branch, in which you can see all of your history, but the actual commit objects are disconnected from the old history, and so you can push the new commits to GitHub without the old commits coming along. Push your master branch to GitHub, and only the new commits will go to GitHub. But take a look at the history in gitk or git log, and you'll see the full history.

git push github master:master
gitk --all

Gotchas

If you ever base any new branches on the old commits, you will have to be careful to keep the history separate; otherwise, new commits on those branches will really have the old commits in their history, and so you'll pull the whole history along if you push it up to GitHub. As long as you keep all of your new commits based on your new master, though, you'll be fine.

If you ever run git push --tags github, that will push all of your tags, including old ones, which will cause all of your old history to be pulled along with it. You could deal with this by deleting all of your old tags (git tag -d $(git tag -l)), or by never using git push --tags but only ever pushing tags manually, or by using two repositories as described below.

The basic problem underlying both of these gotchas is that if you ever push any ref which connects to any of the old history (other than via the replaced commits), you will push up all of the old history. Probably the best way of avoiding this is by using two repositories, one which contains only the new commits, and one which contains both the old and new history, for the purpose of inspecting the full history. You do all of your work, your committing, your pushing and pulling from GitHub, in the repository with just the new commits; that way, you can't possibly accidentally push your old commits up.

You then pull all of your new commits into your repository that has the full history, whenever you need to look at the entire thing. You can either pull from GitHub or your other local repository, whichever is more convenient. It will be your archive, but to avoid accidentally publishing your old history, you don't ever push to GitHub from it. Here's how you can set it up:

~$ mkdir newrepo
~$ cd newrepo
newrepo$ git init
newrepo$ git pull ~/oldrepo master
# Now newrepo has just the new history; we can set up oldrepo to pull from it
newrepo$ cd ~/oldrepo
oldrepo$ git remote add newrepo ~/newrepo
oldrepo$ git remote update
oldrepo$ git branch --set-upstream master newrepo/master
# ... do work in newrepo, commit, push to GitHub, etc.
# Now if we want to look at the full history in oldrepo:
oldrepo$ git pull

If you're on Git older than 1.7.2

You don't have git checkout --orphan, so you'll have to do it manually by creating a fresh repository from the current revision of your existing repository, and then pulling in your old disconnected history. You can do this with, for example:

oldrepo$ mkdir ~/newrepo
oldrepo$ cp $(git ls-files) ~/newrepo
oldrepo$ cd ~/newrepo
newrepo$ git init
newrepo$ git add .
newrepo$ git commit -m "Import clean version of my code"
newrepo$ git fetch ~/oldrepo master:old-master

If you're on Git older than 1.6.5

git replace and replace refs were added in 1.6.5, so you'll have to use an older, somewhat less flexible mechanism known as grafts, which allow you to specify alternate parents for a given commit. Instead of the git replace command, run:

echo $(git rev-parse master) $(git rev-parse old-master) >> .git/info/grafts

This will make it look, locally, as if the master commit has the old-master commit as its parent, so you will see one more commit than you would with git replace.