Jaa


The Myth of Unstructured Data

There's no such thing as "unstructured data."

Yes, you hear that term all the time in the database industry, and it has appeared in countless books and articles. It's all wrong.

Data has meaning, and it is impossible to have meaning without structure.

Some people categorize data as either relational or "unstructured," with relational data being data that fits well into columns and rows and "unstructured" data being everything else, such as narrative text and images.

The truth is that what is referred to as relational data is actually data with very simple structures, and what is referred to as "unstructured" data is actually data with very complex structures.

Relational data is data with very simple structures of list-based relationships. A single column of data in a normalized data table is a list. Each value in that list is an instance of the attribute that defines that list. For example, a list of colors might have "color" as the name of the column and values of red, white, and blue as values in separate rows in that column. In that case, "red," "white," and "blue" are related to each other because the share the definition that they are colors. Two columns in a normalized table will have data that has a one-to-one relationship between those columns, as the value "Aaron" in a FirstName column could relate to the value "Adams" in a LastName column when representing a specific person. Two columns with a one-to-many or many-to-many relationship will have to be in separate tables for a normalized design. When normalized data is denormalized into star or snowflake designs, it's management changes, and it's storage may change, but it's still the same data with the same simple, list-based relationships.

Data with complex structures, erroneously referred to as unstructured, are primarily narrative text, images (single and sequential), audio, and executable code. Narrative text has complexly-related grammatical structures, with examples such as novels and news stories. Still images have complex structures of related patterns of color, luminance, etc., with examples such as photographs and drawings. Sequences of images have complex structures that are the same as still images with the added dimension of time, with examples such as video and animation. Audio data has complex structures of related frequencies and amplitude. Executable computer code has complex structures of related commands and variable values designed for sequential or parallel CPU consumption.

Some data can have hybrid levels of complexity, such as XML, which uses a set of simply-structured tags which may contain either simply-structured data or complexly-structured data.

If text, something visual, or audio has no structure, it is meaningless, and is therefore not data. If text has no coherent structure, it is random characters. If visuals have no coherent structures, they are meaningless. If audio has no coherent structure, it is random noise.

So, data is either simple or complex. It is never unstructured.