哪个提交有这个斑点？

小开

我认为这将是一个通常有用的东西，所以我写了一个小 perl 脚本来完成它:

#!/usr/bin/perl -w


use strict;


my @commits;
my %trees;
my $blob;


sub blob_in_tree {
my $tree = $_[0];
if (defined $trees{$tree}) {
return $trees{$tree};
}
my $r = 0;
open(my $f, "git cat-file -p $tree|") or die $!;
while (<$f>) {
if (/^\d+ blob (\w+)/ && $1 eq $blob) {
$r = 1;
} elsif (/^\d+ tree (\w+)/) {
$r = blob_in_tree($1);
}
last if $r;
}
close($f);
$trees{$tree} = $r;
return $r;
}


sub handle_commit {
my $commit = $_[0];
open(my $f, "git cat-file commit $commit|") or die $!;
my $tree = <$f>;
die unless $tree =~ /^tree (\w+)$/;
if (blob_in_tree($1)) {
print "$commit\n";
}
while (1) {
my $parent = <$f>;
last unless $parent =~ /^parent (\w+)$/;
push @commits, $1;
}
close($f);
}


if (!@ARGV) {
print STDERR "Usage: git-find-blob blob [head ...]\n";
exit 1;
}


$blob = $ARGV[0];
if (@ARGV > 1) {
foreach (@ARGV) {
handle_commit($_);
}
} else {
handle_commit("HEAD");
}
while (@commits) {
handle_commit(pop @commits);
}

今晚回家后我会把这个贴在 Github 上。

更新: 看起来像某人已经这么做了。这一个使用相同的一般思想，但细节是不同的，实现是很多短。我不知道哪个会更快，但性能可能不是一个问题！

更新2: 值得一提的是，我的实现数量级更快，特别是对于大型存储库。那个 git ls-tree -r真的很疼。

更新3: 我应该注意到，我上面的性能评论适用于我在第一个更新中链接的实现。亚里士多德的实现的性能与我的相当。对于那些好奇的人，请在评论中提供更多细节。

小开

最佳答案

以下两个脚本都将 blob 的 SHA1作为第一个参数，在它之后是 git log可以理解的任何参数。例如，--all在所有分支中搜索，而不仅仅是当前的分支，或者 -g在重新构建(reflog)中搜索，或者其他你喜欢的分支。

下面是一个 shell 脚本——简短而甜蜜，但是很慢:

#!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=tformat:'%T %h %s' \
| while read tree commit subject ; do
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo $commit "$subject"
fi
done

佩尔的一个优化版本，仍然很短，但速度快得多:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;


my $obj_name;


sub check_tree {
my ( $tree ) = @_;
my @subtree;


{
open my $ls_tree, '-|', git => 'ls-tree' => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";


while ( <$ls_tree> ) {
/\A[0-7]{6} (\S+) (\S+)/
or die "unexpected git-ls-tree output";
return 1 if $2 eq $obj_name;
push @subtree, $2 if $1 eq 'tree';
}
}


check_tree( $_ ) && return 1 for @subtree;


return;
}


memoize 'check_tree';


die "usage: git-find-blob <blob> [<git-log arguments ...>]\n"
if not @ARGV;


my $obj_short = shift @ARGV;
$obj_name = do {
local $ENV{'OBJ_NAME'} = $obj_short;
`git rev-parse --verify \$OBJ_NAME`;
} or die "Couldn't parse $obj_short: $!\n";
chomp $obj_name;


open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
or die "Couldn't open pipe to git-log: $!\n";


while ( <$log> ) {
chomp;
my ( $tree, $commit, $subject ) = split " ", $_, 3;
print "$commit $subject\n" if check_tree( $tree );
}

小开

虽然最初的问题没有要求它，但是我认为检查临时区域以查看是否引用了一个 blob 是有用的。我修改了原来的 bash 脚本来完成这项工作，发现在我的存储库中引用了一个损坏的 blob:

#!/bin/sh
obj_name="$1"
shift
git ls-files --stage \
| if grep -q "$obj_name"; then
echo Found in staging area. Run git ls-files --stage to see.
fi


git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo $commit "$subject"
fi
done

小开

所以... 我需要找到所有的文件超过一个给定的限制，在回购超过8GB 的大小，超过108,000修订。我改编了亚里士多德的 perl 脚本和我写的 Ruby 脚本，以达到这个完整的解决方案。

首先，git gc-这样做以确保所有对象都在包文件中-我们不扫描不在包文件中的对象。

接下来运行这个脚本来定位 CUTOFF _ SIZE 字节上的所有 blobs

#!/usr/bin/env ruby


require 'log4r'


# The output of git verify-pack -v is:
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
#
#
GIT_PACKS_RELATIVE_PATH=File.join('.git', 'objects', 'pack', '*.pack')


# 10MB cutoff
CUTOFF_SIZE=1024*1024*10
#CUTOFF_SIZE=1024


begin


include Log4r
log = Logger.new 'git-find-large-objects'
log.level = INFO
log.outputters = Outputter.stdout


git_dir = %x[ git rev-parse --show-toplevel ].chomp


if git_dir.empty?
log.fatal "ERROR: must be run in a git repository"
exit 1
end


log.debug "Git Dir: '#{git_dir}'"


pack_files = Dir[File.join(git_dir, GIT_PACKS_RELATIVE_PATH)]
log.debug "Git Packs: #{pack_files.to_s}"


# For details on this IO, see http://stackoverflow.com/questions/1154846/continuously-read-from-stdout-of-external-process-in-ruby
#
# Short version is, git verify-pack flushes buffers only on line endings, so
# this works, if it didn't, then we could get partial lines and be sad.


types = {
:blob => 1,
:tree => 1,
:commit => 1,
}




total_count = 0
counted_objects = 0
large_objects = []


IO.popen("git verify-pack -v -- #{pack_files.join(" ")}") do |pipe|
pipe.each do |line|
# The output of git verify-pack -v is:
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
data = line.chomp.split(' ')
# types are blob, tree, or commit
# we ignore other lines by looking for that
next unless types[data[1].to_sym] == 1
log.info "INPUT_THREAD: Processing object #{data[0]} type #{data[1]} size #{data[2]}"
hash = {
:sha1 => data[0],
:type => data[1],
:size => data[2].to_i,
}
total_count += hash[:size]
counted_objects += 1
if hash[:size] > CUTOFF_SIZE
large_objects.push hash
end
end
end


log.info "Input complete"


log.info "Counted #{counted_objects} totalling #{total_count} bytes."


log.info "Sorting"


large_objects.sort! { |a,b| b[:size] <=> a[:size] }


log.info "Sorting complete"


large_objects.each do |obj|
log.info "#{obj[:sha1]} #{obj[:type]} #{obj[:size]}"
end


exit 0
end

接下来，编辑该文件以删除任何不需要等待的 blobs，并删除顶部的 INPUT _ THREAD 位。一旦你只有你想要找到的 sha1的代码行，像这样运行下面的脚本:

cat edited-large-files.log | cut -d' ' -f4 | xargs git-find-blob | tee large-file-paths.log

下面是 git-find-blob脚本。

#!/usr/bin/perl


# taken from: http://stackoverflow.com/questions/223678/which-commit-has-this-blob
# and modified by Carl Myers <cmyers@cmyers.org> to scan multiple blobs at once
# Also, modified to keep the discovered filenames
# vi: ft=perl


use 5.008;
use strict;
use Memoize;
use Data::Dumper;




my $BLOBS = {};


MAIN: {


memoize 'check_tree';


die "usage: git-find-blob <blob1> <blob2> ... -- [<git-log arguments ...>]\n"
if not @ARGV;




while ( @ARGV && $ARGV[0] ne '--' ) {
my $arg = $ARGV[0];
#print "Processing argument $arg\n";
open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $arg or die "Couldn't open pipe to git-rev-parse: $!\n";
my $obj_name = <$rev_parse>;
close $rev_parse or die "Couldn't expand passed blob.\n";
chomp $obj_name;
#$obj_name eq $ARGV[0] or print "($ARGV[0] expands to $obj_name)\n";
print "($arg expands to $obj_name)\n";
$BLOBS->{$obj_name} = $arg;
shift @ARGV;
}
shift @ARGV; # drop the -- if present


#print "BLOBS: " . Dumper($BLOBS) . "\n";


foreach my $blob ( keys %{$BLOBS} ) {
#print "Printing results for blob $blob:\n";


open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
or die "Couldn't open pipe to git-log: $!\n";


while ( <$log> ) {
chomp;
my ( $tree, $commit, $subject ) = split " ", $_, 3;
#print "Checking tree $tree\n";
my $results = check_tree( $tree );


#print "RESULTS: " . Dumper($results);
if (%{$results}) {
print "$commit $subject\n";
foreach my $blob ( keys %{$results} ) {
print "\t" . (join ", ", @{$results->{$blob}}) . "\n";
}
}
}
}


}




sub check_tree {
my ( $tree ) = @_;
#print "Calculating hits for tree $tree\n";


my @subtree;


# results = { BLOB => [ FILENAME1 ] }
my $results = {};
{
open my $ls_tree, '-|', git => 'ls-tree' => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";


# example git ls-tree output:
# 100644 blob 15d408e386400ee58e8695417fbe0f858f3ed424    filaname.txt
while ( <$ls_tree> ) {
/\A[0-7]{6} (\S+) (\S+)\s+(.*)/
or die "unexpected git-ls-tree output";
#print "Scanning line '$_' tree $2 file $3\n";
foreach my $blob ( keys %{$BLOBS} ) {
if ( $2 eq $blob ) {
print "Found $blob in $tree:$3\n";
push @{$results->{$blob}}, $3;
}
}
push @subtree, [$2, $3] if $1 eq 'tree';
}
}


foreach my $st ( @subtree ) {
# $st->[0] is tree, $st->[1] is dirname
my $st_result = check_tree( $st->[0] );
foreach my $blob ( keys %{$st_result} ) {
foreach my $filename ( @{$st_result->{$blob}} ) {
my $path = $st->[1] . '/' . $filename;
#print "Generating subdir path $path\n";
push @{$results->{$blob}}, $path;
}
}
}


#print "Returning results for tree $tree: " . Dumper($results) . "\n\n";
return $results;
}

输出如下:

<hash prefix> <oneline log message>
path/to/file.txt
path/to/file2.txt
...
<hash prefix2> <oneline log msg...>

诸如此类。在其树中包含大文件的每个提交都将被列出。如果你取出以制表符开头的行和取出以制表符开头的行，就会得到一个列表，其中列出了所有可以过滤-分支删除的路径，或者你可以做一些更复杂的事情。

让我重申一下: 这个过程在10GB 的回购中成功运行，提交了108,000次。这比我预计的时间要长得多，当运行在一个大量的斑点，虽然，超过10个小时，我将不得不看看是否记忆位工作..。

小开

不幸的是，脚本对我来说有点慢，所以我必须优化一点。幸运的是，我不仅有散列，而且还有一个文件的路径。

git log --all --pretty=format:%H -- <path> | xargs -I% sh -c "git ls-tree % -- <path> | grep -q <hash> && echo %"

小开

给定一个 blob 的散列，有没有一种方法可以获得树中包含这个 blob 的提交列表？

对于 Git 2.16(2018年第一季度) ，git describe将是一个很好的解决方案，因为它被教导要更深入地挖掘树，以找到引用给定 blob 对象的 <commit-ish>:<path>。

参见犯下罪行，提交4dbc59a，犯下罪行，提交 c87b653，承认吧(2017年11月16日)和提交91904f5，犯下第二个错误(2017年11月2日)。
^{(由朱尼奥 · C · 哈马诺 gitster于2017年12月28日在犯下556de1a合并)}

描述一个斑点

有时候，给用户一个对象的哈希，他们希望进一步识别它(例如: 使用 verify-pack找到最大的斑点, 但是这些是什么? 或者这个非常好的问题“ 哪个提交有这个斑点？”)

在描述提交时，我们尝试将它们锚定到标记或参考文献，如下所示在概念上高于提交。如果没有参考或者标签完全匹配，我们就不走运了。
因此，我们使用一种启发式方法来为提交创建一个名称。这些名称是不明确的，可能有不同的标记或参考锚定，并且可能有不同的路径在 DAG 旅行到达提交精确。

在描述一个斑点时，我们希望从更高的层次来描述这个斑点也就是作为树对象的 (commit, deep/path)的元组相当无趣。
同一个 blob 可以被多个提交引用，那么我们如何决定使用哪个提交呢？

这个补丁实现了一个相当幼稚的方法: < strong > 由于没有来自 blob 的反向指针来提交发生了 blob 的内容，我们将从任何可用的提示开始，按照提交的顺序列出 blob，一旦我们找到 blob，我们将获取列出 blob 的第一个提交
例如:
git describe --tags v0.99:Makefile
conversion-901-g7672db20c2:Makefile
告诉我们在 v0.99中的 Makefile是在提交7672db2中引入的。

步行按相反的顺序执行，以显示而不是它的最后出现。

这意味着 git describe手册页增加了这个命令的用途:

git describe不是简单地使用最新的标记来描述提交，而是在使用 git describe <blob>时，根据可用的引用为对象提供一个人类可读的名称。

如果给定的对象引用了一个 blob，它将被描述为 <commit-ish>:<path>，这样就可以在 <commit-ish>的 <path>处找到这个 blob，<commit-ish>本身描述了这个 blob 在 HEAD 的反向修订遍历中出现的第一次提交。

但是:

臭虫

树对象以及不指向提交的标记对象不能被描述为 。
在描述 blob 时，指向 blob 的轻量级标记被忽略，但是 blob 仍然被描述为 <committ-ish>:<path>，尽管轻量级标记很受欢迎。

小开

除了 git describe，我在前面的回答中提到过，git log和 git diff现在也受益于“ --find-object=<object-id>”选项，以限制发现的变化，涉及命名的对象。
那是在 Git 2.16. x/2.17(2018年第一季度)中

见提交4d8c51a，第五季，第50525集，提交15af58c，提交 cf63051，提交 c1ddc46，提交929ed70(2018年1月4日) by 斯蒂芬 · 贝勒(stefanbeller)。
^{(由朱尼奥 · C · 哈马诺 gitster于2018年1月23日在犯下罪行合并)}

diffcore: 添加鹤嘴锄选项以查找特定的斑点

有时，给用户一个对象的散列，他们希望进一步识别它(例如: 使用 valid- pack 查找最大的 blobs, 但这些是什么? 或者这个堆栈溢出问题“ 哪个提交有这个斑点？”)

人们可能会想要扩展 git-describe来处理 blobs, 以至于 git describe <blob-id>描述为 “ <commit-ish>:<path>”。
这是在这里实施，从这个绝对的答案的数量(> 110) ，结果证明这是很难得到正确的。
要做到这一点，最难的是选择正确的“承诺” 可以是(重新)引入 blob 的提交，也可以是去除斑点; 斑点可以存在于不同的分支中。

朱尼奥暗示了解决这个问题的另一种方法补丁工具。
教 diff机器另一个标志限制信息显示什么。
例如:
$ ./git log --oneline --find-object=v2.0.0:Makefile
b2feb64 Revert the whole "ask curl-config" topic for now
47fbfde i18n: only extract comments marked with "TRANSLATORS:"
我们观察到与 2.0一起装运的 Makefile出现在 v1.9.2-471-g47fbfded53和 v2.0.0-rc1-5-gb2feb6430b。
这两次提交都发生在 v2.0.0之前的原因是邪恶的没有使用这种新机制找到的合并。

正如 Marcono1234在评论中指出的那样，你可以把它与 Git Log 全部选项结合起来:

当您不知道哪个分支包含对象时，这可能很有用。

小开

对于人类来说，最有用的命令可能是

git whatchanged --all --find-object=<blob hash>

这表明，在 --all分支中，任何添加或删除具有该散列的文件的提交，以及路径。

git$ git whatchanged --all --find-object=b3bb59f06644
commit 8ef93124645f89c45c9ec3edd3b268b38154061a
⋮
diff: do not show submodule with untracked files as "-dirty"
⋮
:100644 100644 b3bb59f06644 8f6227c993a5 M      submodule.c


commit 7091499bc0a9bccd81a1c864de7b5f87a366480e
⋮
Revert "submodules: fix of regression on fetching of non-init subsub-repo"
⋮
:100644 100644 eef5204e641e b3bb59f06644 M  submodule.c

注意，git whatchanged已经在其输出行中包含了 before-and-after blob 散列。

哪个提交有这个斑点？

描述一个斑点

臭虫

diffcore: 添加鹤嘴锄选项以查找特定的斑点

`diffcore`: 添加鹤嘴锄选项以查找特定的斑点