Cleaning a bit the git repo


#1

when cloning the whole juce repo with the whole history, we end up with 260MB

I had a look at the biggest files, below is the result.
I’m no git expert at all, but I guess it should somehow be possible to “fully” remove some of those files (AnimationAppExample.app…) from git to reduce the size a bit?

$ git rev-list --objects --all | grep "$(git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -20 | awk '{print$1}')"

8b2307c77f12b25e7c0737317e9215a814248f5f examples/AnimationAppExample/Builds/iOS/build/Debug/AnimationAppExample.app/AnimationAppExample
6c4e12b39e5659c43d4b858165301f96dc017929 examples/AnimationAppExample/Builds/iOS/build/Debug/AnimationAppExample.app/AnimationAppExample
aeb45da05b2c75a5092629f45ee85144e0679199 examples/AnimationAppExample/Builds/MacOSX/build/Debug/AnimationAppExample.app/Contents/MacOS/AnimationAppExample
a90cdff769311c3968faed6e0e6714e5bd88a31c examples/AudioAppExample/Builds/MacOSX/build/Debug/AudioAppExample.app/Contents/MacOS/AudioAppExample
7df9d4e5dc2c619965f78894ac9893eed9684759 examples/AnimationAppExample/Builds/MacOSX/build/Debug/AnimationAppExample.app/Contents/MacOS/AnimationAppExample
11493ecea2e16a09ef60c8dcff4fa3a99db822e9 juce_amalgamated.h
0e9a7136ad15d4d7a7cdcefa12345053b019b251 juce_amalgamated.cpp
50a7a737c873d35fbf8afce26190f4d94bf69ee8 juce_amalgamated.h
73153201aac54c51a260e12a30e2bf91651954b5 docs/JuceAPIDocs.zip
1309c29bf585ff63034bfbf441397653c0059a7f extras/prebuilt/JuceDemo.dmg
1a4c03928c4f858df5f5368764d54cc7fdc6d255 extras/prebuilt/PluginHost.dmg
a3ed2e60c6a08b6c3558fe1e0623f47d014bcc2c docs/JuceAPIDocs.zip
96685b5b795aa113f2683d4565a70c2db8b20e4d extras/prebuilt/JuceDemo.dmg
7282fc1d3ac7b49eae6d2f44aeb16b6702715cb8 docs/JuceAPIDocs.zip
ff6bb70f74cddaac47591975593cca3398d55b96 docs/JuceAPIDocs.zip
106140bac6e4abef8af6ca57aabb9e0d64a502ff docs/JuceAPIDocs.zip
340856cb7208d068881280735a6f9c2e27d0f5b0 extras/prebuilt/JuceDemo.dmg
5ab0099ca721598f9157a4f04e3e70679a805769 extras/prebuilt/JuceDemo.dmg
3af340affeae374059369797bd14f844dd6796be juce_amalgamated.cpp
bc3a95b8cab3e2f05c726840853d6ca0867d4821 juce_amalgamated.cpp

#2

I don’t think this is possible without forcing all JUCE users to re-clone the repository, which we don’t want to do.


#3

I’m also not much of a GIT expert, but can we somehow purge the old history and leave just the last few years, without messing up the SHA keys?


#4

No way!*

*unless you know how to take advantage of the collisions recently found in SHA by Google :sweat_smile: With those, you could theoretically manage to craft commits that have different content but still the same final checksum.
That would make for an excellent April Fool’s joke :innocent:


#5

Maybe juce5 could start as a fork --depth=1, so users can decide themselves, if they want the original with almost 10 years of history, or having only the latest development.

But I am also only an average git user, don’t know if that would work…


#6

The problem with doing a fork is that it’d be a new repo, with none of the existing github forks or stars. We’ll have a think about ways we might be able to do this without it messing up everyone’s repos.


#7

That would be bad because that would forget all previous history that has led to the current code.
It would be very hard to know what a certain patch changed, and why.

A better approach would be to still create a new fork, but with rewritten history, where all those huge files are simply not tracked but the commit history and the majority of changes can still be reconstructed if necessary.


#8

Yes you are absolutely right, my idea was more having both in parallel, merging all commits from the juce5 repo to the original one, so you can work with the full history or the new one. But I just realise, that it is probably not much benefit for having to maintain two repos. Also everybody can clone with depth anyway…


#9

I wonder if it’s possible to take a github repo, make a shallow clone of it, and then force-push it back to github without breaking everything?


#10

For that, I think you could do it safely by taking advantage of the fact that you can actually create a new commit with no parent in the same repo.

So to say, that would be a new “Initial commit” for a new branch which is completely unrelated to the others. There, you can do what you want while still keeping the current develop and master intact.

When the result in this “purged” branch is satisfactory, you can move the develop and master branch labels to it (while perhaps keeping a “legacy” branch label on the old line of development for some time, just in case…)


#11

…but if by “shallow” you mean with truncated history, then I’d advice against it for the reasons mentioned above.

It would be better to have the “purged” branch be the same as the original, but with the references to the big files removed. This would be possible by fiddling with the rewrite history commands of git, like git filter-branch


#12

Ideally I think it’d be good for us to keep just e.g. 5 years history and have an archived copy of the full one for people who need to go through the ancient stuff. That’d shrink it considerably.


#13

I’m still not entirely convinced that would be a good idea, however since that won’t stop you from considering it, then I’d suggest retaining the history since around commit aa6e9d38deca22d661218cabcbb745f6a0fea64b.
That is the commit that brought the modules structure into the main line, it dates February 2012 which also seems to fit the 5 years timeframe that you have in mind.

That being said, also be aware that altering the history of the repo in a destructive way will break all projects that reference JUCE as a submodule, so I think the commits of the “legacy” line should be kept living in the main repo for quite some time before being dropped completely (and possibly moved to a different, “archive” repo)


#14

Well, we’d never do that. This’d only be something we’d do if it we could find a way that wouldn’t affect people’s existing clones.


#15

I’m no git expert by any imagination… but perhaps the post by Campbell in this thread on stackoverflow may show an option.

Rail


#16

This is another way of partitioning off your git history: https://git-scm.com/blog/2010/03/17/replace.html

However, unless I’m missing something clever, I think any process which modifies the history of a branch will change the hashes of the commits on it - which will, in turn, break submodules using it.


#17

There’s an article from Atlassian about his:
https://www.atlassian.com/blog/git/handle-big-repositories-git

Personally, I don’t have a problem with a full Juce clone being 260 MB.
I usually only clone it once on a machine anyway (and then sometimes copy the folder + switch branch if I need multiple simultaneous versions for some reason).

Shallow clones seem to be a good solution if full clones really can’t be used (I never had to use it though). “Shallow clones used to be somewhat impaired citizens of the Git world as some operations were barely supported. But recent versions (1.9 and above) have improved the situation greatly, and you can properly pull and push to repositories even from a shallow clone now.”


#18

I wonder if the newly-discovered SHA1 vulnerability would let us fake-up the ash for an early commit so that we could do a git filter-branch that keeps all subsequent hashes the same? :slight_smile:


#19

Mine above was just a joke :joy: however if this ever gets applied, it’s one of the very exceptional cases in which something that appeared as a bug turns out to be a useful feature


#20

(Oh, sorry - I didn’t notice that you’d already made the same joke earlier!)