Type of data
Research data can be described in many different ways. For example, they can be divided by source or by physical format. The sources of data can, for example, be registers (e.g. administrative, historical, voting results, medical, etc.), existing research data, population group(s) and communications. Physical formats of data include numerical, textual, still image, geospatial, audio, video and software. Regardless of the source and physical format of the data, data is often defined by as how they are created/captured. Examples of this includes electronic text documents, spreadsheets, laboratory notebooks, field notebooks and diaries, questionnaires, transcripts and codebooks, audiotapes and videotapes, photographs and films, examination results, specimens, samples, artefacts, slides, database schemas, database contents, models, algorithms and scripts, workflows, standard operating procedures and protocols, experimental results, metadata and other data files like e.g. literature review records and email archives.
When we speak about “new data”, we mean the data that has emerged quite recently. Such data are sometimes referred to as Big Data, but both terms do not have agreed definitions.
The scholarly literature usually describes Big Data by their attributes. All of these attributes start with the letter "V" and they are Volume, Velocity and Variety (Couper, 2013).
- Volume means that Big Data are very large and that processing them demands great computational power.
- Velocity stands for the fact that Big Data are produced successively and new data emerge every moment.
- Variety reminds us that Big Data are unstructured and messy and thus not ready for immediate analysis.
Some authors add two more Vs, Veracity and Value (e.g., Wamba et al, 2015):
- Veracity tells us that Big Data must be carefully examined from the perspective of their trustworthiness. In other words, researchers should be careful about the quality of Big Data.
- Value means that Big Data potentially generate valuable insights that are important for decision-makers, policy-makers, researchers and various organizations.
Depending on their source, the OECD defines six categories of Big Data:
A: Data stemming from the transactions of government, for example, tax and social security systems.
B: Data describing official registration or licensing requirements.
C: Commercial transactions made by individuals and organisations.
D: Internet data, deriving from search and social networking activities.
E: Tracking data, monitoring the movement of individuals or physical objects subject to movement by humans.
F: Image data, particularly aerial and satellite images but including land-based video images.
Social media data (category D) are the data from platforms like Facebook, Twitter, Instagram or YouTube. These data are created by the users of such platforms. Researchers can access these data in three main ways: 1) Direct cooperation with the companies/platforms, 2) Buying from data resellers, 3) Via APIs (one might add web scraping to the list but most platforms/companies disencourage its use).