An Introduction to Big Data Concepts
The idea that data collected in computerized systems could be used to inform and thereby improve decision making has been around for quite some time. Over the last couple decades, ideas of how to assemble a decision support system have coalesced around the concept of a data warehouse.
The construction of a proper data warehouse requires a non-trivial investment. This investment is made with the expectation of benefits, but these are often difficult to enumerate prior to the warehouse’s construction and subsequent employment. For this reason, the data warehouse requires a leap of faith.
For many years, preparation for this leap was a significant part of the conversation with customers interested in Business Intelligence (BI). Today, in recognition of the data warehouse as a tool for navigating business challenges and uncertainty, the conversation tends to focus on maximizing the impact of BI on the organization.
As customers focus on how best to extract insights from data, there is growing recognition of untapped data resources especially unstructured data. These data remain largely untapped because:
- The value of these data relative to the cost of their processing and storage is low.
- These data are not easily stored and analyzed within the confines of the traditional data warehouse.
To illustrate these points, consider the data in a web log. These data could be very insightful to a business interested in engaging customers through a website. However, individual data records, holding information on a single page request or single image retrieval, are not likely to be high in value, especially over the longer periods of time in which data are stored in a traditional data warehouse.
Furthermore, the structure of many elements within the log records, such as the URI of the referrer or the query string associated with a requested resource is highly variable in nature. Differing questions posed against these data may require them to be interpreted in differing ways. Significant pre-processing of the data in order to neatly fit it into the traditional data warehouse may be unnecessary or even counter-productive.
Web logs are a commonly cited form of unstructured data. A better term for these data may be complex or mixed-typed data as at some level these data have a well understood and meaningful structure. However, this structure is often as a level of granularity higher than the level at which analysis is to be performed, and it’s this mismatch that leads to the unstructured moniker. Other forms of unstructured data include XML or JSON documents, images, video, or PDF, Word, or HTML documents.
The challenges of working with unstructured data, illustrated in the web log example, are often characterized in terms of four Vs. The four Vs are identified as:
- Volume – Defined as the total number of bytes associated with the data. Unstructured data are estimated to account for 70-85% of the data in existence and the overall volume of data is rising.
- Velocity – Defined as the pace at which the data are to be consumed. As volumes rise, the value of individual data points tend to more rapidly diminish over time.
- Variety – Defined as the complexity of the data in this class. This complexity eschews traditional means of analysis.
- Variability – Defined as the differing ways in which the data may be interpreted. Differing questions require differing interpretations.
The four Vs articulate the broad challenges of working with unstructured data, but the dominant challenge tends to be in terms of data volume. As a result, the effort to extract insights from unstructured data is often referred to as Big Data.
Because of the challenges of the four Vs, Big Data necessitates an alternative approach to Business Intelligence. This alternative approach, which we might refer to as the unstructured data warehouse or the Big Data warehouse, does not invalidate the traditional data warehouse but does acknowledge its limitations in extracting insights from the full range of available data resources. What exactly is the unstructured data warehouse and how it will relate to the traditional (structured) data warehouse has yet to be determined, but ideas are beginning to coalesce around distributed, algorithmic technologies such as Apache Hadoop.