Condividi tramite


ContentSync – a content-based file copy tool

Suppose you need to sync the contents of two large folders, Source and Destination. Normally robocopy *.* Source Destination /MIR does the job. However even if the byte content of a file didn’t change, but the timestamp did, robocopy will copy the file and change the timestamp of the destination to match the source.

This is the safe and expected behavior, however there are whole classes of tools that track file modification based off the timestamp alone. For instance, MSBuild will trigger a cascading rebuild of all dependencies of a file if its timestamp has changed (even if the actual file contents is exactly the same). This is called overbuilding. A scalable build system should detect that the file contents didn’t change and avoid doing any work in this case.

Or take another example, suppose you need to upload hundreds of thousands of files to Azure using MSDeploy. If the timestamp on those files has changed, MSDeploy will upload those files even if the actual content is the same.

In general, if you’re deciding whether a file was modified based off the timestamp, you’re bound to schedule unnecessary work that could have been avoided if you checked whether the actual file bytes have changed.

Long story short, I wrote an open-source tool to sync/mirror directories based on the file content, not the timestamp:

https://github.com/KirillOsenkov/ContentSync

Usage: ContentSync.exe <Source> <Destination>

I needed this for the exact Azure website deployment scenario: I maintain https://source.roslyn.io, and we’ve recently moved from on-premises Microsoft hosting to Azure. I use https://github.com/KirillOsenkov/SourceBrowser to regenerate the website every night, and that’s 160,000 modified files totaling over 1 GB. You don’t want to upload that to Azure every night.

So I inserted an intermediate step into my deployment script – I ContentSync from the freshly built Index folder (it’s not incremental, all the files are generated from scratch every time) into a Staging folder from the last time. ContentSync doesn’t touch the files that haven’t actually changed (and that’s 99.9…%), and so MSDeploy doesn’t upload them since the timestamp is the same.

While I’m here, kudos to David Ebbo and his excellent https://github.com/davidebbo/WAWSDeploy that I use to simply deploy a folder to Azure (read more here: https://blog.davidebbo.com/2014/03/WAWSDeploy.html). It uses MSDeploy internally and hides all the details I don’t want to know about behind a super simple UX.

Comments

  • Anonymous
    January 09, 2016
    Was the usechecksum option not suitable?

  • Anonymous
    January 10, 2016
    Betty - I don't know about usechecksum option. Is this in robocopy? Do you have more details?

  • Anonymous
    January 10, 2016
    Now this is embarrassing :) Betty, thanks for the pointer :) Turns out usechecksum is the MSDeploy option. I should try that. Don't know how I missed that. Now the whole tool wasn't really needed in my case!

  • Anonymous
    January 10, 2016
    Well, no wonder I couldn't find the option previously. It is not listed in msdeploy /? help.

  • Anonymous
    January 14, 2016
    Should this blog post be updated to a how to use "UseCheckSum"?

  • Anonymous
    January 15, 2016
    Since I'm not an expert on msdeploy I'd rather not write about something I don't know about and have never used. After a quick search it seems that that functionality has bugs and maybe other things I'm not aware about. I'd rather leave this as an exercise to the reader :)

  • Anonymous
    January 17, 2016
    Thanks for publishing awsome stuff here! I use github.com/.../Octodiff (a la rdiff - used by Octopus Deploy team to deploy release on destination machine) to deploy only changed part of package. octodiff supposed to work on zip/or any large binary file. It may not be so effective in your case - it's just one more solution. ^)

  • Anonymous
    January 19, 2016
    Sergio - thanks! Octodiff seems to be completentary to this tool - it knows how to efficiently copy a single file (whereas I just use File.Copy). Also my tool is extra bad in that respect, it reads the remote file twice (first to compare, then to actually copy).