Why does cloning from VSTS return old unreferenced objects?
UPDATE (2017-08-09):
We rolled out commit reachability bitmap indexes to VSTS and removed the clone cheat mentioned below. Cloning will no longer download unreachable objects! . We still don't have true object-level git gc
on the server yet, but clone sizes will be smaller now.
TFS on-prem will get these changes in v.Next (not in any TFS 2017 updates, but the next major release). As Brian Harry mentioned, we should have a release candidate for v.Next in a few weeks.
We'll probably expand on this in future blog posts, but unlike core Git, we use Roaring bitmaps instead of EWAH bitmaps. Daniel Lemire has some great blog posts and publications on bitmap indexes that we greatly enjoyed and benefited from.
Original Post:
Note: "core Git" refers to the official base Git implementation, as opposed to Visual Studio or GitHub, or VSTS, which may involve non-standard implementations or behavior.
A customer asked:
We removed some unwanted binaries from our repo on visualstudio.com by following the instructions at https://help.github.com/articles/remove-sensitive-data/. We force-pushed to master and deleted all our other branches.
After running
git gc
locally, our local repo is now 5 MB, but git clone from visualstudio.com still returns 100MB. The old unreferenced blobs are still being sent down by the server.How do we
git gc
(or some equivalent) on the server as well?
There are two issues here:
There is no equivalent to
git gc
on VSTS yet.Our server preserves the history of every ref/branch update to Git repos, including deleted branches. This is analogous to the "reflog" in core Git. On VSTS, we expose the reflog via the REST API and the Branch Updates (i.e. pushes) tab in Web Access. Similarly to core Git, objects in the reflog are still considered to be referenced and will not be deleted by
git gc
. Core Git can eventually prune old reflog entries viagit prune
orgit gc
, but VSTS does not have that functionality yet.Large fetches are expensive for the server to calculate, so we cheat a little.
Large fetches (and clones) have historically been very expensive in both core Git and VSTS due to the "counting objects" phase. https://githubengineering.com/counting-objects/ has a nice explanation of the problem, as well as how core Git and GitHub have (cleverly) improved the perf w/ bitmap indexes.
Unfortunately, VSTS does not have that perf fix yet. Instead, it cheats a bit and blindly streams back every object that exists on the server if the client has nothing and is asks for all branches and tags (e.g. for git clone). This is generally reasonable, until a user decides to dereference most of the objects in their repo to save space!
I suspect that the customer would not have minded the lack of gc in his scenario if we only sent reachable objects during clone.
Until these issues are fixed for VSTS, what workarounds are there?
Delete the repo from the server (EDIT: or rename it) and re-push it.
This works, but is sub-optimal. In the new repo, you won't be able to see old pull request details, branch update history, and any links from other areas like builds or work items.
Trick the server by not cloning everything at once:
mkdir newRepo git init git remote add origin #fetch one branch first git fetch origin master #fetch everything else git fetch origin
Comments
- Anonymous
March 30, 2016
I've just spent a better part of the day trying to shrink the repo on TFS 2015, now I know why everything failed. Thanks for this post, it is really helpful.What are the plans for fixing this?- Anonymous
June 01, 2016
The comment has been removed - Anonymous
August 09, 2017
It rolled out on VSTS, and will ship in TFS on-prem v.Next as well (see update above post)- Anonymous
August 28, 2017
Is it implemented in vsts now ? How can be quickly check ?- Anonymous
August 28, 2017
Yes. At the time that I mentioned that "It rolled out on VSTS", the rollout was already complete. You can verify this with the following steps:1. Create a new branch in your local copy of some repo on VSTS. Assume the new branch name is "NewBranch".2. Create a new commit in NewBranch. Assume the commit ID is abc123.3. Push NewBranch to the repo on VSTS.4. At this point if you reclone the repo from VSTS, in the new clone: * "git banch -rv" should show origin/NewBranch and its commit ID abc123 * "git catfile -p abc123" should show the contents of the commit5. Delete NewBranch from the repo on VSTS4. At this point if you reclone the repo from VSTS, in the new clone: * "git branch -rv" should NOT show origin/NewBranch (this has always been the case) * "git catfile -p abc123" should say that abc123 is not valid (unlike in the past when abc123 could get cloned even if NewBranch was deleted)
- Anonymous
- Anonymous
- Anonymous
- Anonymous
July 06, 2017
Is this functionality still in backlog ?- Anonymous
July 10, 2017
The comment has been removed - Anonymous
August 09, 2017
It rolled out on VSTS, and will ship in TFS on-prem v.Next as well (see update above post)
- Anonymous