Understanding data versioning in machine learning
Data versioning, also known as version control for data, is the practice of systematically tracking changes made to data over time. It's analogous to version control systems used in software development, such as Git, but applies to datasets and data-related assets. Data versioning is important for several reasons:
Reproducibility: Data versioning enables researchers, data scientists, and analysts to reproduce and validate their work by providing a historical record of changes to the data. This history is crucial for ensuring the reliability and integrity of data-driven experiments and analyses.
Collaboration: When many individuals or teams are working on the same dataset, data versioning helps manage concurrent changes and avoid conflicts. It provides a way to merge or compare different versions of the data.
Auditing and Compliance: In regulated industries or organizations with strict data governance requirements, data versioning ensures that data changes are traceable and compliant with relevant standards and regulations.
Error Recovery: Data versioning gives you the opportunity to roll back to an earlier version of the data if there are flaws in a newer version.
Even with all the success in machine learning, especially with deep learning and its applications in business, data scientists still lack best practices for organizing their projects and collaborating effectively. This is a critical challenge: while ML algorithms and methods are no longer trivial knowledge, they're still difficult to develop, reuse, and manage. Implementation of data versioning using various tools helps to solve these problems. One such tool is DVC (Data Version Control).
Learn about data version control
DVC is your "Git for data"!
DVC is a data science tool that takes advantage of existing software engineering tool sets. It helps machine learning teams manage large datasets, make projects reproducible, and work together better. For individuals or teams who store and process data files or datasets to create other data or machine learning models, DVC provides a comprehensive solution so users can:
Track and save data and machine learning models the same way you capture code.
Create and switch between versions of data and ML models.
Understand how to build datasets and ML artifacts.
Compare model metrics among experiments.
Adopt engineering tools and best practices in data science projects.
Explore solutions with DVC
DVC lets users capture the versions of data and models in Git commits, while storing them on-premises or in cloud storage. It also provides a mechanism to switch between these different data contents. The result is a single history for data, code, and ML models that are ready for traversal.
With DVC, users can enable data versioning through codification. By producing simple metadata files once, users can describe what datasets and ML artifacts to track, and can store this metadata in Git instead of large files. DVC allows for the creation of snapshots of data, the restoration of earlier versions, and the recording of evolving metrics, etc.
As you use DVC, unique versions of data files and directories are cached in a systematic way, preventing file duplication. The working data store is separate from the workspace to keep the project light, but remains connected via file links automatically handled by DVC.
Benefits of using the DVC approach include:
Lightweight: DVC is a free, open-source command line tool that doesn't require databases, servers, or any other special services.
Consistency: Keeping projects readable with stable file names; since the actual data files have unique pointer artifacts, file names and directories of different versions of the data do not have to be changed in the source code. DVC manages the pointer artifact to point to the different files in storage, which represent a unique dataset.
Efficient data management: Using a familiar and cost-effective storage solution for data and models (for example: Azure Blob, S3)—free from Git hosting constraints. DVC optimizes storing and transferring large files.
Collaboration: Distributing project development and sharing its data internally and remotely, or reusing it in other places.
Data compliance: Review data modification attempts as Git pull requests. Audit the project's immutable history to know when datasets or models became approved, and why.
GitOps: Connecting data science projects with the Git ecosystem. Git workflows open the door to advanced CI/CD tools, specialized patterns such as data registries, and other best practices.
Use Azure Blob Storage with DVC
DVC remotes allow access to external storage locations to track and share data and ML models. Usually, these are shared between devices or team members who are working on a project. For example, a team member can download data artifacts created by colleagues without spending time and resources to regenerate them locally.
Main uses of remote storage:
- Synchronize large files and directories tracked by DVC.
- Centralize or distribute data storage for sharing and collaboration.
- Back up different versions of datasets and models (saving space locally).
Use Azure Blob Storage
Start with DVC remote add to define the remote. Set a name and a valid Azure Blob Storage URL:
dvc remote add -d myremote azure://<mycontainer>/<path>
<container>
- name of a blob container. DVC will try to create it if needed.<path>
- optional path to a virtual directory in your bucket
Integrate with Azure Machine Learning Service
Azure Machine Learning Service fully integrates with Azure Blob Storage to give authenticated access to data for AI training and inference workloads. However, AML doesn't offer a native data versioning capability, which can result in data duplication and management headaches during the AI experimentation and model generation lifecycle.