如何确定 Git 是以二进制文件还是以文本形式处理文件?

我知道 Git 以某种方式自动检测文件是二进制文件还是文本文件,如果需要,可以使用 .gitattributes手动设置。但是,是否还有一种方法可以询问 Git 如何处理文件呢?

假设我有一个包含两个文件的 Git 存储库: 一个包含纯文本的 ascii.dat文件和一个包含随机二进制文件的 binary.dat文件。Git 将第一个 .dat文件作为文本处理,将第二个文件作为二进制文件处理。现在我想编写一个 Git web 前端,它有一个文本文件查看器和一个二进制文件特殊查看器(例如显示十六进制转储)。当然,我可以实现我自己的文本/二进制检查,但是如果查看器依赖于 Git 如何处理这些文件的信息,那么它会更有用。

那么,我怎样才能问 Git 它是将文件作为文本还是二进制文件对待呢?

23069 次浏览

I don't like this answer, but you can parse the output of git-diff-tree to see if it is binary. For example:

git diff-tree -p 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- MegaCli
diff --git a/megaraid/MegaCli b/megaraid/MegaCli
new file mode 100755
index 0000000..7f0e997
Binary files /dev/null and b/megaraid/MegaCli differ

as opposed to:

git diff-tree -p 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- megamgr
diff --git a/megaraid/megamgr b/megaraid/megamgr
new file mode 100755
index 0000000..50fd8a1
--- /dev/null
+++ b/megaraid/megamgr
@@ -0,0 +1,78 @@
+#!/bin/sh
[…]

Oh, and BTW, 4b825d… is a magic SHA which represents the empty tree (it is the SHA for an empty tree, but git is specially aware of this magic).

builtin_diff()1 calls diff_filespec_is_binary() which calls buffer_is_binary() which checks for any occurrence of a zero byte (NUL “character”) in the first 8000 bytes (or the entire length if shorter).

I do not see that this “is it binary?” test is explicitly exposed in any command though.

git merge-file directly uses buffer_is_binary(), so you may be able to make use of it:

git merge-file /dev/null /dev/null file-to-test

It seems to produce the error message like error: Cannot merge binary files: file-to-test and yields an exit status of 255 when given a binary file. I am not sure I would want to rely on this behavior though.

Maybe git diff --numstat would be more reliable:

isBinary() {
p=$(printf '%s\t-\t' -)
t=$(git diff --no-index --numstat /dev/null "$1")
case "$t" in "$p"*) return 0 ;; esac
return 1
}
isBinary file-to-test && echo binary || echo not binary

For binary files, the --numstat output should start with - TAB - TAB, so we just test for that.


1 builtin_diff() has strings like Binary files %s and %s differ that should be familiar.

git grep -I --name-only --untracked -e . -- ascii.dat binary.dat ...

will return the names of files that git interprets as text files.

The trick here is in these two git grep parameters:

  • -I: Don’t match the pattern in binary files.
  • -e .: Regular expression match any character in the file

You can use wildcards e.g.

git grep -I --name-only --untracked -e . -- *.ps1

You can use command-line tool 'file' utility. On Windows it's included in git installation and normally located in in C:\Program Files\git\usr\bin folder

file --mime-encoding *

See more in Get encoding of a file in Windows

At the risk of getting slapped for poor code quality, I'm listing a C utility, is_binary, built around the original buffer_is_binary() routine in the Git source. Please see internal comments for how to build and run. Easily modifyable:

/***********************************************************
* is_binary.c
*
* Usage: is_binary <pathname>
*   Returns a 1 if a binary; return a 0 if non-binary
*
* Thanks to Git and Stackoverflow developers for helping with these routines:
* - the buffer_is_binary() routine from the xdiff-interface.c module
*   in git source code.
* - the read-a-filename-from-stdin route
* - the read-a-file-into-memory (fill_buffer()) routine
*
* To build:
*    % gcc is_binary.c -o is_binary
*
* To build debuggable (to push a few messages to stdout):
*    % gcc -DDEBUG=1 ./is_binary.c -o is_binary
*
* BUGS:
*  Doesn't work with piped input, like
*    % cat foo.tar | is_binary
*  Claims that zero input is binary. Actually,
*  what should it be?
*
* Revision 1.4
*
* Tue Sep 12 09:01:33 EDT 2017
***********************************************************/
#include <string.h>
#include <stdio.h>
#include <stdlib.h>


#define MAX_PATH_LENGTH 200
#define FIRST_FEW_BYTES 8000


/* global, unfortunately */
char *source_blob_buffer;


/* From: https://stackoverflow.com/questions/14002954/c-programming-how-to-read-the-whole-file-contents-into-a-buffer */


/* From: https://stackoverflow.com/questions/1563882/reading-a-file-name-from-piped-command */


/* From: https://stackoverflow.com/questions/6119956/how-to-determine-if-git-handles-a-file-as-binary-or-as-text
*/


/* The key routine in this function is from libc: void *memchr(const void *s, int c, size_t n); */
/* Checks for any occurrence of a zero byte (NUL character) in the first 8000 bytes (or the entire length if shorter). */


int buffer_is_binary(const char *ptr, unsigned long size)
{
if (FIRST_FEW_BYTES < size)
size = FIRST_FEW_BYTES;
/* printf("buff = %s.\n", ptr); */
return !!memchr(ptr, 0, size);
}
int fill_buffer(FILE * file_object_pointer) {
fseek(file_object_pointer, 0, SEEK_END);
long fsize = ftell(file_object_pointer);
fseek(file_object_pointer, 0, SEEK_SET);  //same as rewind(f);
source_blob_buffer = malloc(fsize + 1);
fread(source_blob_buffer, fsize, 1, file_object_pointer);
fclose(file_object_pointer);
source_blob_buffer[fsize] = 0;
return (fsize + 1);
}
int main(int argc, char *argv[]) {


char pathname[MAX_PATH_LENGTH];
FILE *file_object_pointer;


if (argc == 1) {
file_object_pointer = stdin;
} else {
strcpy(pathname,argv[1]);
#ifdef DEBUG
printf("pathname=%s.\n", pathname);
#endif
file_object_pointer = fopen (pathname, "rb");
if (file_object_pointer == NULL) {
printf ("I'm sorry, Dave, I can't do that--");
printf ("open the file '%s', that is.\n", pathname);
exit(3);
}
}
if (!file_object_pointer) {
printf("Not a file nor a pipe--sorry.\n");
exit (4);
}
int fsize = fill_buffer(file_object_pointer);
int result = buffer_is_binary(source_blob_buffer, fsize - 2);


#ifdef DEBUG
if (result == 1) {
printf ("%s %d\n", pathname, fsize - 1);
}
else {
printf ("File '%s' is NON-BINARY; size is %d bytes.\n", pathname, fsize - 1);
}
#endif
exit(result);
/* easy check -- 'echo $?' after running */
}

@bonh gave a working answer in a comment

git diff --numstat 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- | grep "^-" | cut -f 3

It shows all files which git interprets as binaries.

# considered binary (or with bare CR) file
git ls-files --eol | grep -E '^(i/-text)'


# files that do not have any line-ending characters (including empty files) - unlikely that this is a true binary file ?
git ls-files --eol | grep -E '^(i/none)'


#                                                        via experimentation
#                                                      ------------------------
#    "-text"        binary (or with bare CR) file     : not    auto-normalized
#    "none"         text file without any EOL         : not    auto-normalized
#    "lf"           text file with LF                 : is     auto-normalized when gitattributes text=auto
#    "crlf"         text file with CRLF               : is     auto-normalized when gitattributes text=auto
#    "mixed"        text file with mixed line endings : is     auto-normalized when gitattributes text=auto
#                   (LF or CRLF, but not bare CR)

Source: https://git-scm.com/docs/git-ls-files#Documentation/git-ls-files.txt---eol https://github.com/git/git/commit/a7630bd4274a0dff7cff8b92de3d3f064e321359

Oh by the way: be careful with setting the .gitattributes text attribute e.g. *.abc text. Because in that case all files with *.abc will be normalized, even if they are binary (internal CRLF found in the binary would be normalized to LF). This is different from the auto behaviour.

Use git check-attr --all.

This works regardless of if the file has been staged/committed or not.

Tested on git version 2.30.2.

Assuming you have this in .gitattributes.

package-lock.json binary

There is this output.

git check-attr --all package-lock.json
package-lock.json: binary: set
package-lock.json: diff: unset
package-lock.json: merge: unset
package-lock.json: text: unset

For normal files, there is no output.

git check-attr --all package.json