Data file structure is supported by the organisation of variables. Variable names and labels contribute to the structuring of the data file, allowing to integrate part of the documentation into the data file and helping researchers to orient themselves in the structure of the data sets. At the same time, variable names should be short and should respect the usual requirements of standard software, because they are used as calling codes in software operations.
The position of variables in the data file, their names and labels should reflect the following:
Organising your data
Data files also include supplementary variables which facilitate orientation and management, ensure integrity, or are necessary to perform some analyses. As a rule, you should include a unique identifier (or set of identifiers) for cases (individual respondents) in the file. A unique identifier is an identification code for the case. These are usually numbers, for example, 0001, 0002, 0003 etc. To facilitate orientation, they are usually placed at the very beginning of the file.
Other variables may help to distinguish between different sources of information, methods of observation, temporal or other links. Yet others may provide information about the organisation of data collection such as interviewer ID or interviewing date or distinguish cases which belong to various groups.
It is absolutely necessary for an analysis to distinguish data that result from overrepresentation sampling strategies, different waves of research, etc., especially if groups of cases distinguished by them are to be analysed in different ways.
For each variable in the data file, you should set the variable width, i.e. the number of characters or the length of the integer and fractional parts of a number. The set number of characters or digits for each variable is reserved for every case, even if they are left blank.
Naming variables
In the tabs below basic rules for variable naming are given and an example is presented.
The basic rules for variable naming are following:
Start with a letter. Do not start with a number, question or exclamation marks or a special character such as #, &, $, @ (they are often reserved for specific purposes in software applications);
Variable names cannot contain spaces;
Variable names are also used as calling codes in software operations. For this reason, variables should be short and respect the usual requirements of a standard software. The standard is to not make variable names any longer than eight characters;
Do not use diacritics (marks above or below a letter) or national specific characters;
Make them meaningful (so they can be used for better orientation in the data files).
There are three basic approaches to naming variables:
Using numeric codes that reflect the variable’s position in a system (e.g. V001, V002, V003...);
Using codes that refer to the research instrument (e.g. question number in a questionnaire: Q1a, Q1b, Q2, Q3a...);
Using mnemonic names that refer to the content of variables (e.g. BIRTH for the year of birth, AGE for respondent’s age etc.). The word mnemonic means “memory aid”.
Variable labels
Variable labels provide a short description of the variable name. These can be longer than the recommended eight characters for variable names. Although size limits are less strict here, it is advisable to keep variable labels rather brief and find an adequate compromise between clarity and the size of the label. Keep in mind that many analytical outputs are provided in tables. Thus, excessively lengthy labels can result in large and impractical tabulations. The size of the labels may also complicate format conversions. In some analytical outputs or after conversions, only a part of a lengthy label is kept. The loss of the remainder of the variable label may make the label incomprehensible.
Examples of variable labels include a short or full version of the question, or a question code if variable names are not constructed around them. E.g.:
The variable label is adapted from the number and question-wording from the questionnaire: “B10 - How old are you?”;
The descriptive label is “Age of a respondent”;
Schematically this becomes: “Respondent: AGE”.
To reach the widest audience possible, the preferred language for variable naming is English.
Labels for variable values
Variables have two or more values (a variable with only one value is called a constant and in fact, is not a variable). Sometimes you must assign labels to values of variables. You do not need to assign labels to values of continuous variables like age (in years), height (in metres) or weight (in kilograms), because their units are generally known. This is different for nominal and ordinal variables. A nominal variable like gender has two values, usually represented by 0 and 1 in data. You should assign labels "male"/"female" to these two values, so you and another researcher who might use the data would know which value represents which gender. The same applies to ordinal scales, for example, agree-disagree scale with values 1, 2, 3, 4 and 5, where 1 represents "completely disagree" and 5 "completely agree". You must label these values so you and others know what degree of dis/agreement the numbers represent.
Two different concepts of variable naming and labelling in the data file from the International Social Survey Programme
The International Social Survey Programme (ISSP) is a continuing, long-term international programme of survey research on important sociological topics. It brings together pre-existing, social science projects and coordinates research goals, thereby adding a cross-national perspective to the individual, national studies. Established in 1984, it now has almost 50 member countries. The ISSP surveys are organised annually.
Each ISSP survey contains two international modules:
ISSP thematic module A specific topic of the survey is selected for each year. There are about ten topics, which are repeated at regular intervals. However, sometimes a topic is skipped or replaced by a new one.
ISSP background variables module These include a set of harmonised sociodemographic variables. This module is repeated every year. However, there are also frequent changes in this set of variables.
Two different concepts of variable naming and labelling are used for these two modules.
Table: Excerpt from the variable list of the international dataset from ISSP 2009 on ‘Social Inequalities’ (ISSP Research Group, 2017).
Variable name
Variable label
ISSP 2009 thematic module variables
V73
Q24a Describe yourself: I work hard to complete my daily tasks
V74
Q24b Describe yourself: I perform to the best of my ability
V75
Q24c Describe yourself: I work hard to maintain my performance on a task
V76
Q25a Describe yourself as <14-15-16> years old: I tried hard to go to school every day
V77
Q25b Describe yourself as <14-15-16> years old: I performed to the best of my ability
ISSP background variables
SEX
R: Sex
AGE
R: Age
MARITAL
R: Marital status
COHAB
R: Steady life-partner
EDUCYRS
R: Education I: years of schooling
DEGREE
R: Education II-highest education level
AR_DEGR
Country-specific education: Argentina
AT_DEGR
Country-specific education: Austria
AU_DEGR
Country-specific education: Australia
BE_DEGR
Country-specific education: Belgium
In the table we see two approaches to variable labelling:
Simple variable names The first thematic part of the file contains simple variable names (numeric codes). The information on the numbers of the questions in the common international questionnaire is included in variable labels. It supports better user orientation in the data file. The question numbers are followed by a literal question, sometimes shortened adequately to remain comprehensible and follow the rule of keeping the variable label short. Some ISSP surveys allow alternative wording of questions – possible alternatives are bracketed in inequality signs. Similarly, after country specifics (e.g., country name, the currency used), general names come in inequality signs.
Mnemonic names of variables The second part contains background variables and uses mnemonic names of variables referring to their contents. These background variables are not directly linked to the wording of questions in the international questionnaire but are instead constructed from national versions of data. Their names refer to their contents and simultaneously to links between them (e.g., DEGREE = the education variable transformed into an internationally comparable form, XX_DEGR = education variables using original country-specific coding). Moreover, the set of mnemonic names of background variables is standardised across different ISSP surveys, which allows easier merging of ISSP data files across time and construction of time-series databases.
TIP! Mnemonic variable names may help to establish links between sets of variables within a data file. In addition, in repeated surveys, if the same naming convention of mnemonic names is used, it makes easier merging data over time.