折叠 Git 存储库的历史记录

我们有一个 Git 项目,有相当大的历史。

具体来说,在项目早期,项目中有相当多的二进制资源文件,现在已经被删除了,因为它们实际上是外部资源。

但是,由于之前提交了这些文件,我们的存储库的大小大于200MB (当前的总签出大约为20MB)。

我们想要做的是“折叠”历史记录,使存储库看起来像是从后来的修订版本创建的。比如说

1-----2-----3-----4-----+---+---+
\       /
+-----+---+---+
  1. 创建仓库
  2. 添加了大量的二进制文件
  3. 删除了大量的二进制文件
  4. 新的存储库“启动”方案

因此,实际上我们希望在某个点之前丢失项目历史记录。此时只有一个分支,因此处理多个起点等并不复杂。但是我们不希望丢失所有的历史记录,并使用当前版本启动一个新的存储库。

这可能吗,还是我们注定要永远拥有一个臃肿的存储库?

30501 次浏览

Is git-fast-export what you are looking for?

NAME
git-fast-export - Git data exporter


SYNOPSIS
git-fast-export [options] | git-fast-import


DESCRIPTION
This program dumps the given revisions in a form suitable to be piped into git-fast-
import(1).


You can use it as a human readable bundle replacement (see git-bundle(1)), or as a kind
of an interactive git-filter-branch(1).

Thanks to JesperE's post I looked into git-filter-branch -- that may actually be what you want. It looks like you could retain your earlier commits too except they would be modified since your Big Files were removed. From the git-filter-branch man page:

Suppose you want to remove a file (containing confidential information or copyright violation) from all commits:

git filter-branch --tree-filter 'rm filename' HEAD

Be sure to read that man page... obviously you'd want to do this on a spare clone of your repository to make sure it works as expected.

You can remove the binary bloat and keep the rest of your history. Git allows you to reorder and 'squash' prior commits, so you can combine just the commits that add and remove your big binary files. If the adds were all done in one commit and the removals in another, this will be much easier than dealing with each file.

$ git log --stat       # list all commits and commit messages

Search this for the commits that add and delete your binary files and note their SHA1s, say 2bcdef and 3cdef3.

Then to edit the repo's history, use rebase -i command with its interactive option, starting with the parent of the commit where you added your binaries. It will launch your $EDITOR and you'll see a list of commits starting with 2bcdef:

$ git rebase -i 2bcdef^    # generate a pick list of all commits starting with 2bcdef
# Rebasing zzzzzz onto yyyyyyy
#
# Commands:
#  pick = use commit
#  edit = use commit, but stop for amending
#  squash = use commit, but meld into previous commit
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
pick 2bcdef   Add binary files and other edits
pick xxxxxx   Another change
.
.
pick 3cdef3   Remove binary files; link to them as external resources
.
.

Insert squash 3cdef3 as the second line and remove the line which says pick 3cdef3 from the list. You now have a list of actions for the interactive rebase which will combine the commits which add and delete your binaries into one commit whose diff is just any other changes in those commits. Then it will reapply all of the subsequent commits in order, when you tell it to complete:

$ git rebase --continue

This will take a minute or two.
You now have a repo that no longer has the binaries coming or going. But they will still take up space because, by default, Git keeps changes around for 30 days before they can be garbage-collected, so that you can change your mind. If you want to remove them now:

$ git reflog expire --expire=1.minute refs/heads/master
#all deletions up to 1 minute  ago available to be garbage-collected
$ git fsck --unreachable      # lists all the blobs(files) that will be garbage-collected
$ git prune
$ git gc

Now you've removed the bloat but kept the rest of your history.

You can use git filter-branch with grafts to make the commit number 4 the new root commit of your branch. Just create the file .git/info/grafts with just one line in it containing the SHA1 of commit number 4.

If you now do a git log or gitk you will see that those commands will display commit number 4 as the root of your branch. But nothing will have actually changed in your repository. You can delete .git/info/grafts and the output of git log or gitk will be as before. To actually make commit number 4 the new root you will have to run git filter-branch, with no arguments.