Splitting and shrinking a git repository

I have recently faced the challenge to rewrite a git repository. It has two problems:

  • First problem was small: an user has commited with a badly setup git and E-mail as well as username were not correctly set.
  • Second problem seems more tricky: I was needing to split the git repository in two different one. To be precise on that issue, from the two directories at root (src and deps) have to become the root of their own repository.

I then dig into the doc and it leads me directly to ‘filter-branch’ which was the solution of my two problems. The names of the command is almost self-explanatory: it is used to rewrite branches.

Splitting the git repository

A rapid reading of ‘git help filter-branch’ convince me to give a try to the ‘subdirectory-filter’ subcommand:

--subdirectory-filter

Only look at the history which touches the given subdirectory. The result will contain that directory (and only that) as its project root. Implies
--remap-to-ancestor

Thus to split the directory, I have simply to copy my repository via a clone call and run the filter command:

git clone project project-src
cd project-src
git filter-branch --subdirectory-filter src

Doing once again for the deps directory and I had my two new repositories ready to go.

At once during this cleaning task, I wanted to avoid to loose my directory structure. I mean I want to keep the ‘src’ directory in the ‘src’ repository. Thanks to the examples at the end of ‘git help filter-branch’, I’ve found this trickier command:

git filter-branch --prune-empty --index-filter \\
 'git rm -r --cached --ignore-unmatch deps' HEAD

This literally do the following : for each commit (--index-filter), suppress (rm) recursively (-r) all items of the ‘deps’ directory. If a commit is empty then suppress it from history (--prune-empty).

Shrinking the resulting repository

‘deps’ directory was known to take a lot of disk space and I thus done a check to see the size of the ‘src’ directory. My old friend ‘du’ sadly told me that the split repository has the same size as the whole one ! There is something tricky here. After googling a little bi I’ve found out (mainly by reading Luke Palmer post) that git never destroy immediately a commit. It is always present has an object in the .git/objects directory. To ask for an effective suppression, you’ve got to tell git that some objects are expired and can now be destroyed. The following command will destroy all objects unreachable since more than one day:

git gc --aggressive --prune=1day

Unreachable objects means objects that exist but that aren’t readable from any of the reference nodes. This last definition is taken from ‘git help fsck’. The ‘fsck’ command can be used to check the validity
and connectivity of objects in the database. For example to display unreachable object, you can run:

git fsck --unreachable

Fixing commiter name

My problem on badly authored commits was still remaining. From the documentation, --env-filter subcommand was the one I need to use. The idea of the command is that it will iterate on every commit of the branch giving you some environnement variables:

GIT_COMMITTER_NAME=Eric Leblond
GIT_AUTHOR_EMAIL=eleblond@example.com
GIT_COMMIT=fbf7d74174bf4097fe5b0ec559426232c5f7b540
GIT_DIR=/home/regit/git/oisf/.git
GIT_AUTHOR_DATE=1280686086 +0200
GIT_AUTHOR_NAME=Eric Leblond
GIT_COMMITTER_EMAIL=eleblond@example.com
GIT_INDEX_FILE=/home/regit/git/oisf/.git-rewrite/t/../index
GIT_COMMITTER_DATE=1280686086 +0200
GIT_WORK_TREE=.

If you modify one of them and export the result, the commit will be modifed accordingly. For example, my problem was that commit
from ‘jdoe’ are in fact from ‘John Doe’ which mail is ‘john.doe@example.com’. I thus run the following command:

git filter-branch -f --env-filter '
if [ "${GIT_AUTHOR_NAME}" = "jdoe" ]; then
GIT_AUTHOR_EMAIL=john.doe@example.com;
GIT_AUTHOR_NAME="John Doe";
fi
export GIT_AUTHOR_NAME
export GIT_AUTHOR_EMAIL
'

Git show here once again it has been made by semi-god hackers that have developped it to solve their own source management problems.