Prevent the tracking of very large files#

Motivation#

Due to the fact that Git keeps a history of every change made to files that it tracks, adding very large files can quickly lead to a bloated repository. This applies especially to binary files, although the same effect can happen with data files (e.g. CSV files storing intermediate results such as those that we encourage you to utilize for this course). Therefore, it’s generally considered best practice not to add very large files to Git repositories in the first place.

Note

In this context, “very large” refers to sizes on the order of > 10 MB or so.

There is also a practical reason not to do so, which is that there’s a hard cap on the file size (100 MB) beyond which GitHub won’t allow files to be pushed to remote repositories on the server.

Adding a pre-commit hook script#

Using Git, it’s possible to write arbitrary instructions that are executed each time certain actions are performed (so-called “hooks”). The following instructions (courtesy of ChatGPT) will allow you to automatically check the size of the files to be added in a commit and reject it if there is at least one file above a maximum size threshold.

First, navigate to your Git repository’s .git/hooks/ directory:

cd path/to/your/repo/.git/hooks/

Next, create a file named pre-commit:

touch pre-commit

and make it executable:

chmod +x pre-commit

Finally, edit the pre-commit file and add the following script:

#!/bin/bash

# Set the size limit in bytes (e.g., 5MB = 5242880 bytes)
MAX_SIZE=5242880

# Detect the operating system to use the correct stat syntax
if [[ "$OSTYPE" == "darwin"* ]]; then
    # macOS
    STAT_CMD="stat -f%z"
else
    # Linux and other Unix-like systems
    STAT_CMD="stat -c%s"
fi

# Find files that exceed the size limit
large_files=$(git diff --cached --name-only --diff-filter=A | while read -r file; do
    if [ -f "$file" ]; then
        file_size=$($STAT_CMD "$file")
        if [ "$file_size" -gt "$MAX_SIZE" ]; then
            echo "$file"
        fi
    fi
done)

if [ -n "$large_files" ]; then
    echo "Error: The following files are larger than the allowed limit of $(($MAX_SIZE / 1024 / 1024))MB:"
    echo "$large_files"
    exit 1
fi

exit 0

And that’s it! You should now be protected from committing very large files to your Git repository. To test if it works, generate a large file:

echo "This is a large file." > large_file.blob
# Simulate a large file by adding 10MB of random data
head -c 10M /dev/urandom > large_file.blob

and attempt to git add and git commit it:

git add large_file.blob
git commit -m "Test commit with a large file"

You should see a message like:

Error: The following files are larger than the allowed limit of 5MB:
large_file.blob

Don’t forget to commit the pre-commit file itself to the repository once you’ve verified that it works!