Datasets
A dataset can be described as “a collection of related sets of information composed of separate elements but, which can be manipulated as a unit by a computer” (Oxford English Dictionary, 2019).
In this example, the dataset is the entire table “Shoe Sale Orders” with the elements consisting of the data relating to the titles of columns (merchant, for example), the row (order number), and the data correlation between the two. In other words where rows and columns intersect.
This table listing a set of show sales orders is an example of a dataset:
Order number | Date | Merchant | Number items | Style | Price | trans fee |
---|---|---|---|---|---|---|
1101 | 15-Jan | Tailwind Traders | 100 | Flip flops | 500 | 20 |
1102 | 15-Jan | Fabrikam, Inc. | 50 | Flats | 499 | 10 |
1103 | 15-Jan | Fabrikam, Inc. | 50 | Kitten heels | 1000 | 10 |
1104 | 15-Jan | Northwind Traders | 100 | Ballerina | 799 | 20 |
1105 | 16-Jan | Tailwind Traders | 50 | Kitten heels | 1000 | 10 |
1106 | 16-Jan | Tailwind Traders | 50 | Flats | 499 | 10 |
1107 | 17-Jan | Fabrikam, Inc. | 50 - | Ballerina | 799 | 10 |
1108 | 17-Jan | Northwind Traders | 100 | Flip flops | 500 | 20 |
1109 | 17-Jan | Tailwind Traders | 100 | Ballerina | 799 | 20 |
1110 | 18-Jan | Tailwind Traders | 50 | Boots | 1200 | 10 |
There are three kinds of datasets:
- Private datasets
- Public datasets
- Semi-public datasets
Benefits to public use datasets
- Time saved, as data is already collected
- Low or no cost
- Typically a larger and more geographically diverse sample size than what the researcher might be able to access
- May have annual data collection, allowing for cross sectional, longitudinal, historical, or trend analysis questions
- Initial data management (entry, prep, and organization) is done by the organization
- Typically an abbreviated review by the institutional review board (IRB)
- Accessible
- National/regional or international data possibilities
- Generate new insight into research conducted with the data set by other researchers
Drawbacks to public use datasets
- Fixed sampling design, such as the sample age and geography, might not be appropriate for your research question
- Data collection time period might be out of date or inappropriately timed to capture desired change
- Limited to variables offered and the operationalization of the variables; might not have all variables of interests or offer full picture
- May requires knowledge or learning additional analytics, such as using sampling weights
- Time is needed to understand the original research design and data management in order to make correct assumptions, analytical choices, and conclusions
- Might be inappropriate for your research question
- Lack of control over quality of data collected
- Limited data sets with experimental designs
- Many archived datasets are old and good quality publications don't accept data analysis that drawing on heavily mined and outdated data