ETag does not seem to be MD5 for multipart uploads (as per Gael Fraiteur's comment). In these cases it contains a suffix of minus and a number. However, even the bit before the minus does not seem to be the MD5, even though it is the same length as an MD5. Possibly the suffix is the number of parts uploaded?
I have cross checked jets3t and management console against uploaded files' MD5sum, and ETag seems to be equal to MD5sum. You can just view properties of the file in AWS management console:
The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
The easiest way would be to set the checksum yourself as metadata before you upload these files to your bucket :
ObjectMetadata md = new ObjectMetadata();
md.setContentMD5("foobar");
PutObjectRequest req = new PutObjectRequest(BUCKET, KEY, new File("/path/to/file")).withMetadata(md);
tm.upload(req).waitForUploadResult();
Now you can access these metadata without downloading the file :
Below that's work for me to compare local file checksum with s3 etag.
I used Python
def md5_checksum(filename):
m = hashlib.md5()
with open(filename, 'rb') as f:
for data in iter(lambda: f.read(1024 * 1024), b''):
m.update(data)
return m.hexdigest()
def etag_checksum(filename, chunk_size=8 * 1024 * 1024):
md5s = []
with open(filename, 'rb') as f:
for data in iter(lambda: f.read(chunk_size), b''):
md5s.append(hashlib.md5(data).digest())
m = hashlib.md5(b"".join(md5s))
print('{}-{}'.format(m.hexdigest(), len(md5s)))
return '{}-{}'.format(m.hexdigest(), len(md5s))
def etag_compare(filename, etag):
et = etag[1:-1] # strip quotes
print('et',et)
if '-' in et and et == etag_checksum(filename):
return True
if '-' not in et and et == md5_checksum(filename):
return True
return False
def main():
session = boto3.Session(
aws_access_key_id=s3_accesskey,
aws_secret_access_key=s3_secret
)
s3 = session.client('s3')
obj_dict = s3.get_object(Bucket=bucket_name, Key=your_key)
etag = (obj_dict['ETag'])
print('etag', etag)
validation = etag_compare(filename,etag)
print(validation)
etag_checksum(filename, chunk_size=8 * 1024 * 1024)
return validation
This is a very old question, but I had a hard time find the information below, and this is one of the first places I could find, so I wanted to detail it in case anyone needs.
ETag is a MD5. But for the Multipart uploaded files, the MD5 is computed from the concatenation of the MD5s of each uploaded part.
So you don't need to compute the MD5 in the server. Just get the ETag and it's all.
Say you uploaded a 14MB file and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 4MB. Then take the checksum of their concatenation. Since MD5 checksums are hex representations of binary data, just make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that's done, add a hyphen and the number of parts to get the ETag.
So the only other things you need is the ETag and the upload part size. But the ETag has a -NumberOfParts suffix. So you can divide the size by the suffix and get part size. 5Mb is the minimum part size and the default value. The part size has to be integer, so you can't get things like 7,25Mb each part size. So it should be easy get the part size information.
Checksum Retrieval – The new GetObjectAttributes function returns the checksum for the object and (if applicable) for each part.
This function supports SHA-1, SHA-256, CRC-32, and CRC-32C for checking the integrity of the transmission.
It appears that MD5 is actually not an option for the new features, so this may not resolve your original question, but MD5 is deprecated for lots of reasons, and if use of an alternate checksum works for you, this may be what you're looking for.
I have used the following approach with success. I present here a Python fragment with notes.
Let's suppose we want the MD5 checksum for an object stored in S3 and that the object was loaded using the multipart upload process. The ETag value stored with the object in S3 is not the MD5 checksum we want. The following Python commands can be used to stream the binary of the object, without downloading or opening the object file, to compute the desired MD5 checksum. Please note this approach assumes a connection to the S3 account containing the object has been established, and that the boto3 and hashlib modules have been imported:
#
# specify the S3 object...
#
bucket_name = "raw-data"
object_key = "/date/study-name/sample-name/file-name"
s3_object = s3.Object(bucket_name, object_key)
#
# compute the MD5 checksum for the specified object...
#
s3_object_md5 = hashlib.md5(s3_object.get()['Body'].read()).hexdigest()
This approach works for all objects stored in S3 (i.e., objects that have been loaded with or without using the multipart upload process).