Table of Contents
File formats and data conversion
We use software for creating text documents, websites, databases, photos, 3D models, and movies. Software developers regularly release new versions of their products. It is not self-evident that the new software supports the use of files created with earlier software versions (compatibility). And some software packages even disappear completely from the scene. Conversions of file formats may be costly or result in loss of information or a reduction of data quality. This is exactly why the choice of file formats should be planned carefully.
Short-term data processing: file formats for operability
File format choice depends on your research phase. Choices for short-term data processing may differ from the choices you make for long-term data preservation.
For the reasons of short-term operability, it is advisable to choose a file format that is associated with the specific software that you intend to use for data analysis. Following discipline-specific standards and customs is generally the way to go. However, you should take into consideration how widespread these standards are and to what extent they will allow data processing by others than peers in your own discipline.
Proprietary file formats are owned and copyrighted by a specific company. Their specifications are usually not publicly available and their future development results from decisions and situation of their owner. Thus, the risk of obsolescence is high. However, some proprietary formats, such as Rich Text Format (*.rtf), MP3, MPEG, JPG, MS Excel (*.xls), SPSS (*.sav, *.por), STATA (*.dta) are widely used and you may assume that they will be useful for a reasonable time.
Long-term data preservation: file formats for the future
Standard, open and widespread formats are advisable for long-term storage as they typically undergo fewer changes. Contrary to proprietary formats (see above) specification of open formats is publicly available. Some of them are standardised and maintained by a standards organisation and we may assume that their readability in the future is ensured. Examples of open formats are PDF/A, CSV, TIFF, ASCII, Open Document Format (ODF), XML, Office Open XML, JPEG 2000, PNG, SVG, HTML, XHTML, RSS, CSS, etc.
A very useful tool for searching an appropriate format for different types of data is provided by the UK Data Service (n.d.-b) in the table of Recommended file formats.
Data conversion and possible data loss
Data files, depending on the nature of the data, are based on either text or binary encoding or both. Binary encoded information can be read only by specialised software, text information is universal and can be read by a wide range of different software including text editors.
It is advisable to store your data for use in the future, which means converting them from a current data format to a long-term preservation format. Most software applications offer export or exchange formats that allow a text-formatted file to be created for importing into another program. A typical example is Microsoft Excel, which through the 'Save As' command, can save spreadsheet data in comma delimited format (*.csv or comma separated values). The structure of the rows and columns is preserved through commas and line returns. However, multiple worksheets must be saved as separate *.csv files and any text formatting or macros in the native format will be lost on conversion.
During the process of data conversion, important pieces of information may be lost:
- In the conversion of a statistical dataset (i.e. survey data), parts of the dataset may be lost, same as missing data definitions, decimal numbers, changes in data formats (e.g., numerical into string data type), data also may be truncated;
- In case of texts, i.e. transcriptions of speech, editing such as highlighting, bold texts, headers, footers may be lost;
- In case of images a reduction of resolution, loss of layer, colours may be lost;
- In converting audiovisual data file conversion may reduce sound quality;
- Some file formats are constructed specifically to save space. However, this is done by a reduction of information and data quality. For example, .jpg removes details from images, while .tiff bears full information. Similarly, .mp3 is a lossy format for audio data, while .wav keeps detailed information.
For this reason, the conversion itself should be done by a researcher familiar with the data, so he or she can check for potential undesirable changes in the data that occurred as a result of the conversion.
Due to differences in national character sets you should pay attention also to character coding. Some coding systems (e.g., Windows 1250) do not cover all character sets at the same time. As a result, an adequate language environment (Central European languages) has to be set to ensure correct display, which cannot be done at all times. Other coding systems (e.g., UTF 8) allow correct display of symbols of several character sets simultaneously.
TIP: Plan ahead to simplify data publication
Different data archives have different preferred formats. Knowing about these preferred formats in advance can save you time later when you want to archive and publish your data. Usually preferred formats are frequently used, independent of specific software, and have open specifications. You can find more information on preferred file formats between archives here (RDNL, n.d.). For example, compare the UK Data Service (n.d. b) and DANS (2023), mentioned above, with the National Archives of Australia (n.d.) and the 4TU.Centre for Research Data (2023).