I realize that git works by diff'ing the contents of files. I have some files that I want to copy. To absolutely prevent git from ever getting confused, is there some git command that can be used to copy the files to a different directory (not mv, but cp), and stage the files as well?
The short answer is just "no". But there is more to know; it just requires some background. (And as JDB suggests in a comment, I'll mention why git mv
exists as a convenience.)
Slightly longer: you're right that Git will diff files, but you may be wrong about when Git does these file-diffs.
Git's internal storage model proposes that each commit is an independent snapshot of all the files in that commit. The version of each file that goes into the new commit, i.e., the data in the snapshot for that path, is whatever is in the index under that path at the time you run git commit
.1
The actual implementation, to the first level, is that each snapshotted-file is captured in compressed form as a blob object in the Git database. The blob object is quite independent of every previous and subsequent version of that file, except for one special case: if you make a new commit in which no data have changed, you will re-use the old blob. So when you make two commits in a row, each of which holds 100 files, and only one file is changed, the second commit re-uses 99 previous blobs, and need only snapshot one actual file into a new blob.2
Hence the fact that Git will diff files doesn't enter into making commits at all. No commit depends on a previous commit, other than to store the previous commit's hash ID (and perhaps to re-use exactly-matching blobs, but that's a side effect of them exactly matching, rather than a fancy computation at the time you run git commit
).
Now, all these independent blob objects do eventually take up an exorbitant amount of space. At this point, Git can "pack" objects into a .pack
file. It will compare each object to some selected set of other objects—they may be earlier or later in history, and have the same file name or different file names, and in theory Git could even compress a commit object against a blob object or vice versa (though in practice it doesn't)—and try to find some way to represent many blobs using less disk space. But the result is still, at least logically, a series of independent objects, retrieved completely intact in their original form using their hash IDs. So even though the amount of disk space used goes down (we hope!) at this point, all of the objects are exactly the same as before.
So when does Git compare files? The answer is: Only when you ask it to. The "ask time" is when you run git diff
, either directly:
git diff commit1 commit2
or indirectly:
git show commit # roughly, `git diff commit^@ commmit`
git log -p # runs `git show commit`, more or less, on each commit
There are a bunch of subtleties about this—in particular, git show
will produce what Git calls combined diffs when run on merge commits, while git log -p
normally just skips right over the diffs for merge commits—but these, along with some other important cases, are when Git runs git diff
.
It's when Git runs git diff
that you can (sometimes) ask it to find, or not to find, copies. The -C
flag, also spelled --find-copies=<number>
, asks Git to find copies. The --find-copies-harder
flag (which the Git documentation calls "computationally expensive") looks harder for copies than the plain -C
flag. The -B
(break inappropriate pairings) option affects -C
. The -M
aka --find-renames=<number>
option also affects -C
. The git merge
command can be told to adjust its level of rename detection, but—at least currently—cannot be told to find copies, nor break inappropriate pairings.
(One command, git blame
, does somewhat different copy-finding and the above does not entirely apply to it.)
1If you run git commit --include <paths>
or git commit --only <paths>
or git commit <paths>
or git commit -a
, think of these as modifying the index before running git commit
. In the special case of --only
, Git uses a temporary index, which is a little bit complicated, but it still commits from an index—it just uses the special temporary one instead of the normal one. To make the temporary index, Git copies all the files from the HEAD
commit, then overlays those with the --only
files you listed. For the other cases, Git just copies the work-tree files into the regular index, then goes on to make the commit from the index as usual.
2In fact, the actual snapshotting, storing the blob into the repository, happens during git add
. This secretly makes git commit
much faster, since you don't normally notice the extra time it takes to run git add
before you fire up git commit
.
git mv
existsWhat git mv old new
does is, very roughly:
mv old new
git add new
git add old
The first step is obvious enough: we need to rename the work-tree version of the file. The second step is similar: we need to put the index version of the file into place. The third, though, is weird: why should we "add" a file we just removed? Well, git add
doesn't always add a file: instead, in this case it detects that the file was in the index and isn't anymore.
We could also spell that third step as:
git rm --cached old
All we're really doing is taking the old name out of the index.
But there's an issue here, which is why I said "very roughly". The index has a copy of each file that will be committed the next time you run git commit
. That copy might not match the one in the work-tree. In fact, it might not even match the one in HEAD
, if there is one in HEAD
at all.
For instance, after:
echo I am a foo > foo
git add foo
the file foo
exists in the work-tree and in the index. The work-tree contents and the index contents match. But now let's change the work-tree version:
echo I am a bar > foo
Now the index and work-tree differ. Suppose we want to move the underlying file from foo
to bar
, but—for some strange reason3—we want to keep the index contents unchanged. If we run:
mv foo bar
git add bar
we'll get I am a bar
inside the new index file. If we then remove the old version of foo
from the index, we lose the I am a foo
version entirely.
So, git mv foo bar
doesn't really move-and-add-twice, or move-add-and-remove. Instead, it renames the work-tree file and renames the in-index copy. If the index copy of the original file differs from the work-tree file, the renamed index copy still differs from the renamed work-tree copy.
It's very difficult to do this without a front end command like git mv
.4 Of course, if you plan to git add
everything, you don't need all of this stuff in the first place. And, it's worth noting that if git cp
existed, it probably should also copy the index version, not the work-tree version, when making the index copy. So git cp
really should exist. There also should be a git mv --after
option, a la Mercurial's hg mv --after
. Both should exist, but currently don't. (There's less call for either of these, though, than there is for straight git mv
, in my opinion.)
3For this example, it's kind of silly and pointless. But if you use git add -p
to carefully prepare a patch for an intermediate commit, and then decide that along with the patch, you would like to rename the file, it's definitely handy to be able to do that without messing up your carefully-patched-together intermediate version.
4It's not impossible: git ls-index --stage
will get you the information you need from the index as it is right now, and git update-index
allows you to make arbitrary changes to the index. You can combine these two, and some complex shell scripting or programming in a nicer language, to build something that implements git mv --after
and git cp
.