Thoughts on data longevity
Some data one would expect that will be around for a long time. Longevity however has some interesting challenges associated with them.
At the physical level, the piece of hardward on which data lives may deteriorate to the point that the data is unrecoverable. Having multiple copies tends to help with that, as you don't have a single-point of failure, although you have to be careful of being able to detect the corruption in the first place! Classic solutions to this involve hashes at the logical or physical levels, from one-bit to as long as you want generally.
At a logical level, you may run into the problem of not having the software to even read the data. For example, you really just lost the copy you had of your old DOS-based database manager, or it won't run correctly on the operating systems you currently have. Other times, one particular app may be able to read the data, but have no capability to export it into anything that can be consumed.
In cases such as the latter, support for having the data in a format that can be consumed externally is very useful. The attribute that helps in this case is avoiding dependencies, to reduce the chance that you might not be able to satisfy them. For example, even a plain text data file can be problematic if I don't know the code page or if I can't easily use or write software that uses that one particular code page.
And so, here is my list of why XML can help with a lot of these issues for long-term data storage.
- Text-based, which means that pretty much every programming language will be able to access it.
- Includes the encoding information for the text, so you have some hope of picking the right code page or Unicode encoding.
- "Self documenting". I put this one in quotes because a tag name is usually no replacement for deeper understanding of the data; however it's a much, much better story than trying to reverse engineer a binary file.
- Flexible. There are very few constraints as to what you can represent in XML; you aren't limited to certain shapes or having to satisfy logical constraints imposed on the format.
- Comments! You can embed comments as to when / where information was archived from, and anything else you want, for human consumption.
And that's what I can come up with at the moment - I'm sure there's more, so feel free to add them to the comments on this post.
Enjoy!