From 63558d4a24a32d09f2b23e90413fe00804667d97 Mon Sep 17 00:00:00 2001 From: Nabanita Dash Date: Sat, 8 Jun 2019 20:28:31 +0530 Subject: [PATCH 1/2] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 5c57cd6..561e101 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,8 @@ But, UNF is not perfect. The problems include: - It is quite sensitive to data structure (e.g., "wide" and "long" representations of the same dataset produce different UNFs) - It is not a version control system and provides essentially no insights into what changed, only that a change occurred +[DVC.org](https://github.com/iterative/dvc) is a data-versioning tool which is almost similar to working of Git-SCM. You can learn about it through [this post](https://blog.dataversioncontrol.com/data-version-control-beta-release-iterative-machine-learning-a7faf7c8be67) and [this](https://blog.dataversioncontrol.com/data-version-control-tutorial-9146715eda46). + All of these tools also focus on the data themselves, rather than associated metadata (e.g., the codebook describing the data). While some data formats (e.g., proprietary formats like Stata's .dta and SPSS's .sav) encode this metadata directly in the file, it is not a common feature of widely text-delimited data structures. Sometimes codebooks are modified independent of data values and vice versa, but it's rather to see large public datasets provide detailed information about changes to either the data or the codebook, except in occasional releases. Another major challenge to data versioning is that existing tools version control are not well-designed to handle provenance. When data is generated, stored, or modified, a software-oriented version control system has no obvious mechanism for recording *why* values in a dataset are what they are or why changes are made to particular values. A commit message might provide this information, but as soon as a value is changed again, the history of changes *to a particular value* are lost in the broader history of the data file as a whole. From 64cb8230c4a1610bf489aa47657a3429af9dd3ca Mon Sep 17 00:00:00 2001 From: Nabanita Dash Date: Sun, 9 Jun 2019 19:52:22 +0530 Subject: [PATCH 2/2] changed links for learning DVC Added getting-started links and tutorial links --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 561e101..15f10ad 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ But, UNF is not perfect. The problems include: - It is quite sensitive to data structure (e.g., "wide" and "long" representations of the same dataset produce different UNFs) - It is not a version control system and provides essentially no insights into what changed, only that a change occurred -[DVC.org](https://github.com/iterative/dvc) is a data-versioning tool which is almost similar to working of Git-SCM. You can learn about it through [this post](https://blog.dataversioncontrol.com/data-version-control-beta-release-iterative-machine-learning-a7faf7c8be67) and [this](https://blog.dataversioncontrol.com/data-version-control-tutorial-9146715eda46). +[DVC.org](https://github.com/iterative/dvc) is a data-versioning tool which is almost similar to working of Git-SCM. You can learn about it through [this post](https://dvc.org/doc/get-started) and [this](https://dvc.org/doc/tutorial). All of these tools also focus on the data themselves, rather than associated metadata (e.g., the codebook describing the data). While some data formats (e.g., proprietary formats like Stata's .dta and SPSS's .sav) encode this metadata directly in the file, it is not a common feature of widely text-delimited data structures. Sometimes codebooks are modified independent of data values and vice versa, but it's rather to see large public datasets provide detailed information about changes to either the data or the codebook, except in occasional releases.