Datasets

Completed

A dataset can be described as “a collection of related sets of information composed of separate elements but, which can be manipulated as a unit by a computer” (Oxford English Dictionary, 2019).

In this example, the dataset is the entire table “Shoe Sale Orders” with the elements consisting of the data relating to the titles of columns (merchant, for example), the row (order number), and the data correlation between the two. In other words where rows and columns intersect.

This table listing a set of show sales orders is an example of a dataset:

Order number Date Merchant Number items Style Price trans fee
1101 15-Jan Tailwind Traders 100 Flip flops 500 20
1102 15-Jan Fabrikam, Inc. 50 Flats 499 10
1103 15-Jan Fabrikam, Inc. 50 Kitten heels 1000 10
1104 15-Jan Northwind Traders 100 Ballerina 799 20
1105 16-Jan Tailwind Traders 50 Kitten heels 1000 10
1106 16-Jan Tailwind Traders 50 Flats 499 10
1107 17-Jan Fabrikam, Inc. 50 - Ballerina 799 10
1108 17-Jan Northwind Traders 100 Flip flops 500 20
1109 17-Jan Tailwind Traders 100 Ballerina 799 20
1110 18-Jan Tailwind Traders 50 Boots 1200 10

There are three kinds of datasets:

  • Private datasets
  • Public datasets
  • Semi-public datasets

Benefits to public use datasets

  • Time saved, as data is already collected
  • Low or no cost
  • Typically a larger and more geographically diverse sample size than what the researcher might be able to access
  • May have annual data collection, allowing for cross sectional, longitudinal, historical, or trend analysis questions
  • Initial data management (entry, prep, and organization) is done by the organization
  • Typically an abbreviated review by the institutional review board (IRB)
  • Accessible
  • National/regional or international data possibilities
  • Generate new insight into research conducted with the data set by other researchers

Drawbacks to public use datasets

  • Fixed sampling design, such as the sample age and geography, might not be appropriate for your research question
  • Data collection time period might be out of date or inappropriately timed to capture desired change
  • Limited to variables offered and the operationalization of the variables; might not have all variables of interests or offer full picture
  • May requires knowledge or learning additional analytics, such as using sampling weights
  • Time is needed to understand the original research design and data management in order to make correct assumptions, analytical choices, and conclusions
  • Might be inappropriate for your research question
  • Lack of control over quality of data collected
  • Limited data sets with experimental designs
  • Many archived datasets are old and good quality publications don't accept data analysis that drawing on heavily mined and outdated data