MD5 Checksum in Autosync for Google Drive

July 22, 2015

File sync is quite simple. At least at the beginning. You set up the sync app to sync a folder D(evice) in your device with a folder C(loud) in the cloud storage. When a new file F(ile) appears in folder D, the app uploads it to the cloud into folder C. At this moment we have the same file F on both sides.

Things get complicated fast when the file F changes. Perhaps it’s a photo you took with your phone. It’s been synced. Now you use an app on your phone to crop the photo a little bit and save it back under the same name. The sync app must be able to find out

  1. File F has been synced before, it was the same in folder D and folder C
  2. File F has changed in folder D
  3. File F has not changed in folder C

then makes the right decision: upload the newly modified file F from folder D to folder C.

What would happen if

  1. File F is to be synced to both your phone and tablet. You paired folder C in the cloud storage with folder D in your phone and folder T in your tablet.
  2. We start with the good state: the same file F is synced to both devices and the cloud storage. We have three copies of the same thing.
  3. You’re at the place where Internet is not available. Both your phone and tablet are offline.
  4. You edit file F on both your phone and tablet.
  5. Later you have Internet connection again. Both your phone and tablet start to sync with the cloud storage.

We have a sync conflict. There are two new versions of the same file which was synced before. Autosync for Google Drive cannot decide by itself which version you want to keep. It resolves the conflict by creating a new file “F (conflict 2015-07-22-10-15-23)” which is one version of the original file F. The second version gets the original name F. The app syncs both files to all places. Later on you, the user, must decide which one to keep.

This is one example why it’s very important to detect if a file has changed or not. Failing to do so will cause data loss. Sync apps base their decisions on various file attributes: the name of the file and its location, the size, the last modified timestamp. Unfortunately both the sizes and last modified timestamp may be inaccurate. The ultimate check is to compare the real contents of both versions of the file. However to do so we have to download one version from the cloud storage into the device before we can compare it with the version we already have in the device. One photo may not be such a big deal but there can be many photos or it’s not a photo but a real big video file which can takes many minutes to download. The huge cost makes comparing file contents unusable.

Luckily there is something called “checksum” or “hash”. It’s a fingerprint of the file content. We can compute this small fingerprint, exchange its values between the cloud and the device and compare the fingerprints instead of the actual file contents. If they are the same then we know the file contents are the same.

There are many algorithms to calculate such a fingerprint, but the common ones are MD5 and SHA-1. Google Drive supports MD5. After recent discoveries MD5 is not safe for cryptographic purposes but it’s more than good enough for our sync conflict detection.

Autosync for Google Drive is conservative by design. It’d rather err on the safe side, i.e. better to delare there is a conflict than to silently lose user modification of a file. The downside is sometimes we have false conflicts. With the use of MD5 checksums to determine sync conflicts the probability of false conflicts is now down to zero.

This change appeared in Autosync for Google Drive version 1.7.2 which was released on Google Play earlier today.