Git Annex

Wojciech 'vifon' Siewierski

Git, a de facto industry standard version control system (VCS), was designed to store small to medium files. Why would anyone store big files in a VCS anyway?

The problem

People want to store big files in the VCS's.

ಠ_ಠ

Images, videos, audio files and other types of binary assets.

The consequences

Longer file processing time

Git frequently needs to compare files and calculate their hashes.

Bigger file ⇒ longer file processing time.

More bigger files ⇒ much longer repository processing time.

Larger repository

The files cannot be downloaded selectively
  • ubuntu-16.04.iso (4 GiB)
  • very-important-data.tar.gz (3 GiB)
  • promotional-video.mkv (2.4 GiB)

Larger repository history

January 2017
  • ubuntu-16.04.iso (4 GiB)
  • backups.tar.gz (2.8 GiB)
  • video.mkv (2.1 GiB)
April 2017
  • ubuntu-16.04.iso (4 GiB)
  • backups.tar.gz (3 GiB)
  • video.mkv (2.4 GiB)

The solution

"All problems in computer science can be solved by another level of indirection."
-- David J. Wheeler

  1. Don't commit the big files, commit small identifiers!
  2. Store the files somewhere else.
Enter Git Annex!
  • ubuntu-16.04.iso -> .git/annex/SHA256E-7f...2164.iso
  • backups.tar.gz -> .git/annex/SHA256E-30...ca7d.tar.gz
  • video.mkv -> .git/annex/SHA256E-85...2ea2.mkv

What we get...

Shallow clones by default
No archival data downloaded until requested.
Narrow clones
Only the requested files are downloaded.
File size (for Git)
Git never operates on the big annexed data, only on the lightweight symlinks.

...and what we lose...

File modification
Files are read-only until explicitly "unlocked" to ensure the data integrity.
Portability
On Windows filesystems "direct mode" is used instead of symlinks.
Simplicity
Yet another abstraction layer.

...and beyond!

Git Annex is much more than just a solution to Git's limitations.

Git Annex needs to know where the files are actually stored. A lightweight registry with all files' locations is stored in the Git objects. Even for the offline repositories, even for the files we don't have locally.


vifon@hell-latitude ~/annex λ git annex whereis my-file.txt
whereis my-file.txt (5 copies)
        0c60bf3b-6d49-46eb-bc1e-324cd0c435f6 -- [server]
        5784d71e-5ba0-4d35-a6b0-b1bcb0e49fed -- [NAS]
        9cf5d522-7cc2-4ea9-883a-6a9d1fd83eb5 -- dell [here]
        a5d11164-f0fb-49d0-9dfd-a11eaf745496 -- thinkpad
        df2d03e7-53d0-439d-9b0b-27cdb4345319 -- [USB HDD]
ok
            

Thank you!