为什么 Git 不使用更现代的 SHA？

UPDATE: The above question and this answer are from 2015. Since then Google have announced the first SHA-1 collision: https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

Obviously I can only speculate from the outside looking in about why Git continues to use SHA-1, but these may be among the reasons:

Git was Linus Torvald's creation, and Linus apparently does not want to substitute SHA-1 with another hashing algorithm at this time.
He makes plausible claims that successful SHA-1 collision-based attacks against Git are a good deal harder than achieving the collisions themselves, and considering that SHA-1 is weaker than it should be, not completely broken, that makes it substantially far from a workable attack at least today. Moreover, he notes that a "successful" attack would achieve very little if the colliding object arrives later than the existing one, as the later one would just be assumed to be the same as the valid one and ignored (though others have pointed out that the reverse could occur).
Changing software is time-consuming and error-prone especially when there is existing infrastructure and data based around the existing protocols that will have to be migrated. Even those who produce software and hardware products where cryptographic security is the sole point of the system are still in the process of migrating away from SHA-1 and other weak algorithms in places. Just imagine all those hardcoded unsigned char[20] buffers all over the place ;-), it's a lot easier to program for cryptographic agility at the start, rather than retrofitting it later.
Performance of SHA-1 is better than the various SHA-2 hashes (probably not by so much as to be a deal-breaker now, but maybe was a sticking point 10 years ago), and the storage size of SHA-2 is larger.

Some links:

My personal view would be that whilst practical attacks are probably some time off, and even when they do occur people will probably initially mitigate against them with means other than changing the hash algorithm itself, that if you do care about security that you should be erring on the side of caution with your choices of algorithms, and continually revising upwards your security strengths, because the capabilities of attackers are also going only in one direction, so it would be unwise to take Git as a role model, especially as its purpose in using SHA-1 is not purporting to be cryptographic security.

This is a discussion of the urgency of migrating away from SHA1 for Mercurial, but it applies to Git as well: https://www.mercurial-scm.org/wiki/mpm/SHA1

In short: If you’re not extremely dilligent today, you have much worse vulnerabilities than sha1. But despite that, Mercurial started over 10 years ago to prepare for migrating away from sha1.

work has been underway for years to retrofit Mercurial's data structures and protocols for SHA1's successors. Storage space was allocated for larger hashes in our revlog structure over 10 years ago in Mercurial 0.9 with the the introduction of RevlogNG. The bundle2 format introduced more recently supports the exchange of different hash types over the network. The only remaining pieces are choice of a replacement function and choosing a backwards-compatibility strategy.

If git does not migrate away from sha1 before Mercurial does, you could always add another level of security by keeping a local Mercurial mirror with hg-git.

There is now a transition plan to a stronger hash, so it looks like in future it will use a more modern hash than SHA-1. From the current transition plan:

Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256

最佳答案

Why does it not use a more modern version of SHA?

Dec. 2017: It will. And Git 2.16 (Q1 2018) is the first release to illustrate and implement that intent.

Note: see Git 2.19 below: it will be SHA-256.

Git 2.16 will propose an infrastructure to define what hash function is used in Git, and will start an effort to plumb that throughout various codepaths.

See commit c250e02 (28 Nov 2017) by Ramsay Jones (``).
See commit eb0ccfd, commit 78a6766, commit f50e766, commit abade65 (12 Nov 2017) by brian m. carlson (bk2204).
^{(Merged by Junio C Hamano -- gitster -- in commit 721cc43, 13 Dec 2017)}

Add structure representing hash algorithm

Since in the future we want to support an additional hash algorithm, add a structure that represents a hash algorithm and all the data that must go along with it.
Add a constant to allow easy enumeration of hash algorithms.
Implement function typedefs to create an abstract API that can be used by any hash algorithm, and wrappers for the existing SHA1 functions that conform to this API.

Expose a value for hex size as well as binary size.
While one will always be twice the other, the two values are both used extremely commonly throughout the codebase and providing both leads to improved readability.

Don't include an entry in the hash algorithm structure for the null object ID.
As this value is all zeros, any suitably sized all-zero object ID can be used, and there's no need to store a given one on a per-hash basis.

The current hash function transition plan envisions a time when we will accept input from the user that might be in SHA-1 or in the NewHash format.
Since we cannot know which the user has provided, add a constant representing the unknown algorithm to allow us to indicate that we must look the correct value up.

Integrate hash algorithm support with repo setup

In future versions of Git, we plan to support an additional hash algorithm.
Integrate the enumeration of hash algorithms with repository setup, and store a pointer to the enumerated data in struct repository.
Of course, we currently only support SHA-1, so hard-code this value in read_repository_format.
In the future, we'll enumerate this value from the configuration.

Add a constant, the_hash_algo, which points to the hash_algo structure pointer in the repository global.
Note that this is the hash which is used to serialize data to disk, not the hash which is used to display items to the user.
The transition plan anticipates that these may be different.
We can add an additional element in the future (say, ui_hash_algo) to provide for this case.

Update August 2018, for Git 2.19 (Q3 2018), Git seems to pick SHA-256 as NewHash.

See commit 0ed8d8d (04 Aug 2018) by Jonathan Nieder (artagnon).
See commit 13f5e09 (25 Jul 2018) by Ævar Arnfjörð Bjarmason (avar).
^{(Merged by Junio C Hamano -- gitster -- in commit 34f2297, 20 Aug 2018)}

doc hash-function-transition: pick SHA-256 as NewHash

From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, K12, and so on are all believed to have similar security properties.
All are good options from a security point of view.

SHA-256 has a number of advantages:

It has been around for a while, is widely used, and is supported by just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, SecureTransport, etc).

When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration.

If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one.

So SHA-256 it is.
Update the hash-function-transition design doc to say so.

After this patch, there are no remaining instances of the string "NewHash", except for an unrelated use from 2008 as a variable name in t/t9700/test.pl.

You can see this transition to SHA 256 in progress with Git 2.20 (Q4 2018):

See commit 0d7c419, commit dda6346, commit eccb5a5, commit 93eb00f, commit d8a3a69, commit fbd0e37, commit f690b6b, commit 49d1660, commit 268babd, commit 0d7c4190, commit 0d7c4191, commit 0d7c4192, commit 0d7c4193, commit 0d7c4194 (15 Oct 2018) by commit 0d7c4195.
See commit 6afedba (15 Oct 2018) by SZEDER Gábor (szeder).
^{(Merged by Junio C Hamano -- gitster -- in commit d829d49, 30 Oct 2018)}

replace hard-coded constants

Replace several 40-based constants with references to GIT_MAX_HEXSZ or the_hash_algo, as appropriate.
Convert all uses of the GIT_SHA1_HEXSZ to use the_hash_algo so that they are appropriate for any given hash length.
Instead of using a hard-coded constant for the size of a hex object ID, switch to use the computed pointer from parse_oid_hex that points after the parsed object ID.

GIT_SHA1_HEXSZ is further remove/replaced with Git 2.22 (Q2 2019) and commit d4e568b.

That transition continues with Git 2.21 (Q1 2019), which adds sha-256 hash and plug it through the code to allow building Git with the "NewHash".

See commit 4b4e291, commit 27dc04c, commit 13eeedb, commit c166599, commit 37649b7, commit a2ce0a7, commit 50c817e, commit 9a3a0ff, commit 0dab712, commit 4b4e2910 (14 Nov 2018), and commit 4b4e2911, commit 4b4e2912 (22 Oct 2018) by commit 4b4e2913.
^{(Merged by Junio C Hamano -- gitster -- in commit 33e4ae9, 29 Jan 2019)}

Add a base implementation of SHA-256 support (Feb. 2019)

SHA-1 is weak and we need to transition to a new hash function.
For some time, we have referred to this new function as NewHash.
Recently, we decided to pick SHA-256 as NewHash.
The reasons behind the choice of SHA-256 are outlined in this thread and in the commit history for the hash function transition document.

Add a basic implementation of SHA-256 based off libtomcrypt, which is in the public domain.
Optimize it and restructure it to meet our coding standards.
Pull in the update and final functions from the SHA-1 block implementation, as we know these function correctly with all compilers. This implementation is slower than SHA-1, but more performant implementations will be introduced in future commits.

Wire up SHA-256 in the list of hash algorithms, and add a test that the algorithm works correctly.

Note that with this patch, it is still not possible to switch to using SHA-256 in Git.
Additional patches are needed to prepare the code to handle a larger hash algorithm and further test fixes are needed.

hash: add an SHA-256 implementation using OpenSSL

We already have OpenSSL routines available for SHA-1, so add routines for SHA-256 as well.

On a Core i7-6600U, this SHA-256 implementation compares favorably to the SHA1DC SHA-1 implementation:
SHA-1: 157 MiB/s (64 byte chunks); 337 MiB/s (16 KiB chunks)
SHA-256: 165 MiB/s (64 byte chunks); 408 MiB/s (16 KiB chunks)

sha256: add an SHA-256 implementation using libgcrypt

Generally, one gets better performance out of cryptographic routines written in assembly than C, and this is also true for SHA-256.
In addition, most Linux distributions cannot distribute Git linked against OpenSSL for licensing reasons.

Most systems with GnuPG will also have libgcrypt, since it is a dependency of GnuPG.
libgcrypt is also faster than the SHA1DC implementation for messages of a few KiB and larger.

For comparison, on a Core i7-6600U, this implementation processes 16 KiB chunks at 355 MiB/s while SHA1DC processes equivalent chunks at 337 MiB/s.

In addition, libgcrypt is licensed under the LGPL 2.1, which is compatible with the GPL. Add an implementation of SHA-256 that uses libgcrypt.

The upgrade effort goes on with Git 2.24 (Q4 2019)

See commit aaa95df, commit be8e172, commit 3f34d70, commit fc06be3, commit 69fa337, commit 3a4d7aa, commit e0cb7cd, commit 8d4d86b, commit f6ca67d, commit aaa95df0, commit aaa95df1, commit aaa95df2, commit aaa95df3, commit aaa95df4, commit aaa95df5, commit aaa95df6, commit aaa95df7, commit aaa95df8, commit aaa95df9, commit be8e1720 (18 Aug 2019) by commit be8e1721.
^{(Merged by Junio C Hamano -- gitster -- in commit 676278f, 11 Oct 2019)}

Instead of using GIT_SHA1_HEXSZ and hard-coded constants, switch to using the_hash_algo.

With Git 2.26 (Q1 2020), the test scripts are ready for the day when the object names will use SHA-256.

See commit 277eb5a, commit 44b6c05, commit 7a868c5, commit 1b8f39f, commit a8c17e3, commit 8320722, commit 74ad99b, commit ba1be1a, commit cba472d, commit 277eb5a0, commit 277eb5a1, commit 277eb5a2, commit 277eb5a3, commit 277eb5a4, commit 277eb5a5, commit 277eb5a6, commit 277eb5a7, commit 277eb5a8, commit 277eb5a9, commit 44b6c050 (21 Dec 2019) by commit 44b6c051.
^{(Merged by Junio C Hamano -- gitster -- in commit f52ab33, 05 Feb 2020)}

Example:

t4204: make hash size independent

^{Signed-off-by: brian m. carlson}

Use $OID_REGEX instead of a hard-coded regular expression.

So, instead of using:

grep "^[a-f0-9]\{40\} $(git rev-parse HEAD)$" output

Tests are using

grep "^$OID_REGEX $(git rev-parse HEAD)$" output

And OID_REGEX comes from commit bdee9cd (13 May 2018) by brian m. carlson (bk2204).
^{(Merged by Junio C Hamano -- gitster -- in commit 9472b13, 30 May 2018, Git v2.18.0-rc0)}

t/test-lib: introduce OID_REGEX

^{Signed-off-by: brian m. carlson}

Currently we have a variable, $_x40, which contains a regex that matches a full 40-character hex constant.

However, with NewHash, we'll have object IDs that are longer than 40 characters.

In such a case, $_x40 will be a confusing name.

Create a $OID_REGEX variable which will always reflect a regex matching the appropriate object ID, regardless of the length of the current hash.

And, still for tests:

See commit f303765, commit edf0424, commit 5db24dc, commit d341e08, commit 88ed241, commit 48c10cc, commit f7ae8e6, commit e70649b, commit a30f93b, commit f3037650, commit f3037651, commit f3037652, commit f3037653, commit f3037654, commit f3037655, commit f3037656, commit f3037657, commit f3037658, commit f3037659, commit edf04240, commit edf04241 (07 Feb 2020) by commit edf04242.
^{(Merged by Junio C Hamano -- gitster -- in commit 5af345a, 17 Feb 2020)}

t5703: make test work with SHA-256

^{Signed-off-by: brian m. carlson}

This test used an object ID which was 40 hex characters in length, causing the test not only not to pass, but to hang, when run with SHA-256 as the hash.

Change this value to a fixed dummy object ID using test_oid_init and test_oid.

Furthermore, ensure we extract an object ID of the appropriate length using cut with fields instead of a fixed length.

Some codepaths were given a repository instance as a parameter to work in the repository, but passed the_repository instance to its callees, which has been cleaned up (somewhat) with Git 2.26 (Q1 2020).

See commit b98d188, commit 2dcde20, commit 7ad5c44, commit c8123e7, commit 5ec9b8a, commit a651946, commit eb999b3 (30 Jan 2020) by Matheus Tavares (matheustavares).
^{(Merged by Junio C Hamano -- gitster -- in commit 78e67cd, 14 Feb 2020)}

sha1-file: allow check_object_signature() to handle any repo

^{Signed-off-by: Matheus Tavares}

Some callers of check_object_signature() can work on arbitrary repositories, but the repo does not get passed to this function. Instead, the_repository is always used internally.
To fix possible inconsistencies, allow the function to receive a struct repository and make those callers pass on the repo being handled.

Based on:

sha1-file: pass git_hash_algo to hash_object_file()

^{Signed-off-by: Matheus Tavares}

Allow hash_object_file() to work on arbitrary repos by introducing a git_hash_algo parameter. Change callers which have a struct repository pointer in their scope to pass on the git_hash_algo from the said repo.
For all other callers, pass on the_hash_algo, which was already being used internally at hash_object_file().
This functionality will be used in the following patch to make check_object_signature() be able to work on arbitrary repos (which, in turn, will be used to fix an inconsistency at object.c:parse_object()).

Git 2.38 (Q3 2022) adds support for libnettle, as SHA256 implementation has been added.

See commit e555735 (10 Jul 2022) by brian m. carlson (bk2204).
^{(Merged by Junio C Hamano -- gitster -- in commit 4af2138, 18 Jul 2022)}

sha256: add support for Nettle

^{Signed-off-by: brian m. carlson}

For SHA-256, we currently have support for OpenSSL and libgcrypt because these two libraries contain optimized implementations that can take advantage of native processor instructions.

However:

OpenSSL is not suitable for linking against for Linux distros due to licensing incompatibilities with the GPLv2, and

libgcrypt has been less favored by cryptographers due to some security-related implementation issues, which, while not affecting our use of hash algorithms, has affected its reputation.

Let's add another option that's compatible with the GPLv2, which is Nettle.
This is an option which is generally better than libgcrypt because on many distros GnuTLS (which uses Nettle) is used for HTTPS and therefore as a practical matter it will be available on most systems.
As a result, prefer it over libgcrypt and our built-in implementation.

Nettle also has recently gained support for Intel's SHA-NI instructions, which compare very favorably to other implementations, as well as assembly implementations for when SHA-NI is not available.

A git gc^(man) on git.git sees a 12% performance improvement with Nettle over our block SHA-256 implementation due to general assembly improvements.
With SHA-NI, the performance of raw SHA-256 on a 2 GiB file goes from 7.296 seconds with block SHA-256 to 1.523 seconds with Nettle.

为什么 Git 不使用更现代的 SHA？

Add structure representing hash algorithm

Integrate hash algorithm support with repo setup

doc `hash-function-transition`: pick SHA-256 as NewHash

replace hard-coded constants

Add a base implementation of SHA-256 support (Feb. 2019)

`hash`: add an SHA-256 implementation using OpenSSL

`sha256`: add an SHA-256 implementation using `libgcrypt`

`t4204`: make hash size independent

`t/test-lib`: introduce `OID_REGEX`

`t5703`: make test work with SHA-256

`sha1-file`: allow `check_object_signature()` to handle any repo

`sha1-file`: pass `git_hash_algo` to `hash_object_file()`

`sha256`: add support for Nettle

为什么 Git 不使用更现代的 SHA？

Add structure representing hash algorithm

Integrate hash algorithm support with repo setup

doc hash-function-transition: pick SHA-256 as NewHash

replace hard-coded constants

Add a base implementation of SHA-256 support (Feb. 2019)

hash: add an SHA-256 implementation using OpenSSL

sha256: add an SHA-256 implementation using libgcrypt

t4204: make hash size independent

t/test-lib: introduce OID_REGEX

t5703: make test work with SHA-256

sha1-file: allow check_object_signature() to handle any repo

sha1-file: pass git_hash_algo to hash_object_file()

sha256: add support for Nettle

doc `hash-function-transition`: pick SHA-256 as NewHash

`hash`: add an SHA-256 implementation using OpenSSL

`sha256`: add an SHA-256 implementation using `libgcrypt`

`t4204`: make hash size independent

`t/test-lib`: introduce `OID_REGEX`

`t5703`: make test work with SHA-256

`sha1-file`: allow `check_object_signature()` to handle any repo

`sha1-file`: pass `git_hash_algo` to `hash_object_file()`

`sha256`: add support for Nettle