Git 是否可以将 ZIP 文件作为目录处理,并将 ZIP 中的文件作为 blob 处理?

剧本

假设我被迫使用总是存储在 .zip文件中的一些文件。ZIP 文件中的一些文件是小文本文件,经常更改,而另一些文件较大,但幸运的是它们是静态的(例如图像)。

如果我想把这些 ZIP 文件放在 Git 存储库中,每个 ZIP 都被当作一个 blob,所以每当我提交存储库时,ZIP 文件的大小就会增加... ... 即使只有一个小文本文件发生了变化!

为什么这是现实的

MicrosoftWord2007 /二零一零年.docx和 Excel .xlsx文件是 ZIP 文件..。

我想要的

是否有一种方法告诉 Git 不要把 ZIP 文件当作文件,而是把它们当作目录并把它们的内容当作文件?

好处

但是你说没用?

我意识到,如果没有额外的元数据,这将导致一定程度的模糊性: 在 git checkout上,Git 必须决定是将 foo.zip/bar.txt创建为常规目录中的文件,还是将其创建为 ZIP 文件。然而,我认为这可以通过配置选项来解决。

有两种方法可以实现 (如果它还不存在的话)

  • 在 Git 中使用类似于 minizipIO::Compress::Zip的库
  • 以某种方式添加一个文件系统层,这样 Git 实际上可以将 ZIP 文件视为开始时的目录
27573 次浏览

This doesn't exist, but it could easily exist in the current framework. Just as Git acts differently with displaying binary or ASCII files when performing a diff, it could be told to offer special treatment to certain file types through the configuration interface.

If you don't want to change the code base (although this is kind of a cool idea you've got), you could also script it for yourself by using pre-commit and post-checkout hooks to unzip and store the files, then return them to their .zip state on checkout. You would have to restrict actions to only those files blobs / indexes that are specified by git add.

Either way is a bit of work -- it's just a question of whether the other Git commands are aware of what's going on and play nicely.

Often there are problems with pre-zipped files for applications as they expect the ZIP compression method and file order to be the one they chose. I believe that OpenOffice .odf files have that problem.

That said, if you are simply using any old ZIP file as a method for keeping stuff together that you should be able to create a few simple aliases which will unzip and re-zip when required. The very latest MSysGit (aka Git for Windows) now has both zip and unzip on the shell code side, so you can use them in aliases.

The project I'm currently working on uses ZIP files as the main local version control / archive, so I'm also trying to get a workable set of aliases for sucking these hundreds of ZIP files into Git (and getting them out again ;-) so that the coworkers are happy.

Use bup (presented in details in GitMinutes #24)

It is the only git-like system designed to deal with large (even very very large) files, which means every version of a zip file will only increase the repo from its delta (instead of a full additional copy)

The result is an actual git repo, that a regular Git command can read.

I detail how bup differs from Git in "git with large files".


Any other workaround (like git-annex) isn't entirely satisfactory, as detailed in "git-annex with large files".

From Managing ZIP-based file formats in git:

Note: per comment from Ruben, this is only about getting a proper diff though, not about committing unzipped files.

Open your ~/.gitconfig file (create if not existing already) and add the following stanza:

[diff "zip"]
textconv = unzip -c -a

What it does is using “unzip -c -a FILENAME” to convert your zipfile into ASCII text (unzip -c unzips to STDOUT). Next thing is to create/modify the file REPOSITORY/.gitattributes and add the following

*.pptx diff=zip

which tells git to use the zip-diffing description from the config for files matching the given mask (in this case everything ending with .pptx). Now git diff automatically unzips the files and diffs the ASCII output which is a little better than just “binary files differ”. On the other hand to to the convoluted mess that the corresponding XML of pptx files is, it doesn’t help a lot but for ZIP-files including text (like for example source code archives) this is actually quite handy.

Zippey - A solution using Git file filter

My solution is to use a filter to "flatten" the ZIP file into an monolithic, expanded (may be huge) text file. During git add/commit the ZIP file will be automatically expanded to this text format for normal text diffing, and during checkout, it is automatically zipped up again.

The text file is composed of records, each representing a file in the ZIP file. So you can think this text file is a text-based image for the original ZIP file. If the file in the ZIP file is text indeed, it is copied into the text file; otherwise, it is Base64 encoded before copied into the text format file. This keeps the text file always a text file.

Although this filter does not make each file in the ZIP file a blob, text files are mapped line to line - which is the unit of the diff - while binary files changes can be represented by updates of their corresponding Base64. I think this is equivalent to what the OP imagines.

For details and a prototyping code, you can read the following link:

Zippey Git file filter

Also, credit to the place that inspired me about this solution: Description of how file filter works

The java tool ReZipDoc, similar to Zippey by sippey, allows to handle ZIP files in a nicer way with Git.

How it works

When adding/committing a ZIP based file, Rezip unpacks it and repacks it without compression, before adding it to the index/commit. In an uncompressed ZIP file, the archived files appear as-is in its content (together with some binary meta information before each file). If those archived files are plain-text files, this method will play nicely with Git.

Benefits

The main benefit of Rezip over Zippey, is that the actual file stored in the repository is still a ZIP file. Thus, in many cases, it will still work as-is with the respective application (for example Open Office), even if it is obtained without going through a re-packing-with-compression filter.

How to use

Install the filter(s) on your system:

mkdir -p ~/bin
cd ~/bin


# Download the filer executable
wget https://github.com/costerwi/rezip/blob/master/Rezip.class


# Install the add/commit filter
git config --global --replace-all filter.rezip.clean "java -cp ~/bin Rezip --store"


# (optionally) Install the checkout filter
git config --global --add filter.rezip.smudge "java -cp ~/bin Rezip"

Use the filter in your repository, by adding lines like these to your <repo-root>/.gitattributes file:

[attr]textual     diff merge text
[attr]rezip       filter=rezip textual


# Microsoft Office
*.docx  rezip
*.xlsx  rezip
*.pptx  rezip
# OpenOffice
*.odt   rezip
*.ods   rezip
*.odp   rezip
# Misc
*.mcdx  rezip
*.slx   rezip

The textual part is so that these files are actually shown as text files in diffs.

Here is my approach:

  • Using Git diff filters for replacing the archive files with a content summary

    git config filter.zip.clean "unzip -v %f | tail -n +4 | head -n -2 | awk '{ print \$7,\$8 }' | grep -vE /$ | LC_ALL=C sort -sfk 2,2"
    git config filter.zip.smudge "cat"
    git config filter.zip.required true
    
  • Using a pre-commit hook to extract and add the archive content:

    #!/bin/sh
    #
    # Git archive extraction pre commit hook
    #
    # Created: 2021 by Vivien Richter <vivien-richter@outlook.de>
    # License: CC-BY-4.0
    # Version: 1.0.2
    
    
    # Configuration
    ARCHIVE_EXTENSIONS=$(cat .gitattributes | grep "zip" | tr -d [][:upper:] | cut -d " " -f1 | cut -d. -f2 | head -c -1 | tr "\n" "|")
    
    
    # Processing
    for STAGED_FILE in $(git diff --name-only --cached | grep -iE "\.($ARCHIVE_EXTENSIONS)$")
    do
    # Deletes the old archive content
    rm -rf ".$(basename $STAGED_FILE).content"
    # Extracts the archive content, if the archive itself is not removed
    if [ -f "$STAGED_FILE" ]; then
    unzip -o $STAGED_FILE -d "$(dirname $STAGED_FILE)/.$(basename $STAGED_FILE).content"
    fi
    # Adds extracted or deleted archive content to the stage
    git add "$(dirname $STAGED_FILE)/.$(basename $STAGED_FILE).content"
    done
    
  • Using a post-checkout hook for packing the archives again for usage:

    #!/bin/sh
    #
    # Git archive packing post checkout hook
    #
    # Created: 2021 by Vivien Richter <vivien-richter@outlook.de>
    # License: CC-BY-4.0
    # Version: 1.0.0
    
    
    # Configuration
    ARCHIVE_EXTENSIONS=$(cat .gitattributes | grep "zip" | tr -d [][:upper:] | cut -d " " -f1 | cut -d. -f2 | head -c -1 | tr "\n" "|")
    
    
    # Processing
    for EXTRACTED_ARCHIVE in $(git ls-tree -dr --full-tree --name-only HEAD | grep -iE "\.($ARCHIVE_EXTENSIONS)\.content$")
    do
    # Gets filename
    FILENAME=$(dirname $EXTRACTED_ARCHIVE)/$(basename $EXTRACTED_ARCHIVE | cut -d. -f2- | awk -F '.content' '{ print $1 }')
    # Removes the dummy archive file
    rm $FILENAME
    # Jumps into the extracted archive
    cd $EXTRACTED_ARCHIVE
    # Creates the real archive file
    zip -r9 ../"$FILENAME" $(find . -type f)
    # Jumps back
    cd ..
    done
    
  • Apply the filter at the .gitattributes file:

    # Macro for all file types that should be treated as ZIP archives.
    [attr]zip text filter=zip
    
    
    # Forces `LF` as line endings for text based files inside ZIP archives.
    **/*.content/** text=auto eol=lf
    
    
    # OpenDocument
    *.[oO][dD][tT] zip
    *.[oO][dD][sS] zip
    *.[oO][dD][gG] zip
    *.[oO][dD][pP] zip
    *.[oO][dD][mM] zip
    
    
    # Krita
    *.[kK][rR][aA] zip
    
    
    # VRoid Studio
    *.[vV][rR][oO][iI][dD] zip
    *.[fF][vV][pP] zip
    
  • Add some binary treatment to the .gitattributes file:

    # Macro for all binary files that should use Git LFS.
    [attr]bin -text filter=lfs diff=lfs merge=lfs lockable
    
    
    # Images
    *.[jJ][pP][gG] bin
    *.[jJ][pP][eE][gG] bin
    *.[pP][nN][gG] bin
    *.[aA][pP][nN][gG] bin
    *.[gG][iI][fF] bin
    *.[bB][mM][pP] bin
    *.[tT][gG][aA] bin
    *.[tT][iI][fF] bin
    *.[tT][iI][fF][fF] bin
    *.[sS][vV][gG][zZ] bin
    
  • Add some stuff to the .gitignore file:

    # Auto generated LFS hooks
    .githooks/pre-push
    
    
    # Temporary files
    *~
    
  • Some configuration by:

    1. Install Git LFS
    2. Prepare LFS by issuing the command git lfs install once.
    3. Setup the Git filter.
    4. Install the hooks by issuing the command git config core.hooksPath .githooks.
    5. Apply the checkout hook once by issuing the command .githooks/post-checkout.
    6. Apply the filter once by issuing the command git add -A.

For an example see here: ZIP treatment for Git

Known issues