Table of Contents
Quantitative coding
Quantitative coding is the process of categorising the collected non-numerical information into groups and assigning the numerical codes to these groups. Numeric coding is shared by all statistical software and among others, it facilitates data conversion and measurement comparisons.
For closed-ended questions in survey questionnaires, the coding scheme is often incorporated directly into the questionnaire and data is entered numerically. This process is automated in computer-assisted interviewing (CAPI, CATI, etc.), where an answer and its code are saved immediately into a computer in the course of data collection. Answers can also be coded on paper questionnaires when coders record codes in a designed spot of the questionnaire before they are digitalised. If the numberical codes are not incorporated in your questionnaire, set up a detailed procedure of how to code the different alternatives.
More complex coding exercises, e.g. for textual answers in survey questionnaires, require an independent coding process with a clearly defined design: a coding structure and a procedure and schedule of exercises if there are several coders.
Documentation
The meaning of codes must be documented. Specialized analytic software (SPSS, SAS, STATA, etc.) lets the user assign labels directly to the codes. For the principles of the construction of labels, please, see the sub-section 'Organisation of variables'. If the software does not allow you to assign code labels directly to data, you have to document the codes in a separate document as part of the metadata.
Coding recommendations
In the accordion below you find coding recommendations which are inspired by ICPSR (2012).
All identification variables should be included at the beginning of your data file. Identification variables usually include a unique identification of your study/data file, unique ID numbers of cases in your data file (e.g. ID of the respondent, ID of his/her household, etc.) as well as the identification of other characteristics essential for analysis (e.g. identification of different methods of data collection or sources, identification of the over-sample, etc.).
Code categories should be mutually exclusive, exhaustive, and precisely defined. Ambiguity will cause coding difficulties and problems with the interpretation of the data. You should be able to assign each response of the respondent into one and only one category.
Recording original data, such as age and income, is more useful than collapsing or bracketing the information. With original or detailed data, secondary analysts can determine other meaningful brackets on their own rather than being restricted to those chosen by others.
Responses to closed-ended questions should retain the original coding scheme to avoid errors and confusion. For open-ended questions, investigators can either use a predetermined coding scheme or construct a coding scheme based on major categories that emerge in survey responses. Any coding scheme and its derivation should be reported in study documentation.
Responses recorded as full verbatim (word for word) must be reviewed for disclosure risk and if necessary treated in accordance with applicable personal data protection regulations.
It is advisable to verify the coding of selected cases by repeating the process with an independent coder. This provides means for verification of both the coder’s work and the functionality of your coding scheme.
If a series of responses require more than one field or if the response is very complex (for example a detailed description of one´s occupation), it is advisable to apply a coding scheme distinguishing between major, secondary and any possible lower level categories. The first digit of the code identifies a major category, the second digit can distinguish specific responses within the major categories, etc.
The International Standard Classification of Occupations (ISCO) (International Labour Organisation, 2016) is an example of such a hierarchical category scheme. An example of its use is given below.
Consider the following ...
The use of standardised classifications and coding schemes brings many advantages, e.g.:
- Comparability with data from other studies using the same concept;
- Comprehensibility for researchers who work with these concepts.
A disadvantage lies in the necessity to adapt your research intentions in line with the concept of the coding scheme.
Several standardised classification and coding schemes exist that you can use. For coding occupations it is the International Standard Classification of Occupations (ISCO) (International Labour Organisation, 2016), for coding education it is the International Standard Classification of Education (ISCED) (Unesco, 2011), for geographic territories it is the Nomenclature of territorial units for statistics (NUTS) (Eurostat, 2013), for economic activities it is the Statistical classification of economic activities (NACE) (Eurostat, 2008), for languages it is ISO 639.2 (Library of Congress, n.d.), for disease it is the International Classification of Diseases (ICD) (World Health Organisation, 2016), etc.
Occupational classifications such as the International Standard Classification of Occupations (ISCO) (International Labour Organization, 2010) are examples of widespread standard coding schemes. ISCO is an example of a hierarchical category scheme.
Occupational information has several dimensions and in questionnaire surveys, these need to be collected in detail. This is, as a rule, done by means of one or more open-ended questions.
The current ISCO-2008 uses four-digit codes. In the table below you see some examples.
2 Professionals |
Source: International Labour Organization (2016).
For an example of a recommended methodology of collection of information on occupations see Ganzeboom (2010).
Not all the questions in a questionnaire are answered by all respondents, which results in missing values on a variable level in the data file (so-called item non-response). It is crucial for data integrity to distinguish at least the situations when values are missing, because the variable is not applicable to the particular respondents.
Furthermore, it is often useful for analyses to identify whether the value is missing because the respondent did not know the answer, refused to answer or simply did not answer or consider other reasons for missing values (see the example below). The information on missing values is always an important part of your documentation and promotes transparency of your research work. However, bear in mind that possibilities to differentiate between many different types of the missing values in analysis can be limited by the abilities of your software.
It is advisable to establish a uniform system for coding missing values for the entire database. Typically, negative values or values like 7, 8, 9 or 97, 98, 99 or 997, 998, 999, etc. (where the number of digits corresponds to the variable’s format and the number of valid values) are used for numeric coding of missing values. The coding scheme for missing values should prevent overlapping codes for valid and missing values. For instance, whenever the digit zero is used for missing values, we should bear in mind that zero may represent a valid value for many variables such as personal income.
Respondents in surveys sometimes do not answer all questions in a questionnaire. It is advisable to distinguish between various reasons that data went missing (ICPSR, 2012). The following situations are distinguished in survey research (frequently used acronyms are bracketed):
- No answer (NA): The respondent did not answer a question when he/she should have;
- Refusal: The respondent explicitly refused to answer;
- Don’t Know (DK): The respondent did not answer a question because he/she had no opinion or did not know the information required for answering. As a result, the respondent chose ‘don’t know’, ‘no opinion’ etc. as the answer;
- Processing Error: The respondent provided an answer but, for some reason (interviewer error, illegible record, incorrect coding etc.), it was not recorded in the database.
- Not Applicable/Inapplicable (NAP/INAP): A question did not apply to the respondent. For example, a question was skipped following a filter question (e.g. respondents without a partner did not answer partner-related questions) or some sets of questions were only asked of random subsamples.
- No Match: In this case, data are drawn from different sources, and information from one source cannot be matched with a corresponding value from another source.
- No Data Available: The question should have been asked, but the answer is missing for a reason other than those above or for an unknown reason.
Training coders to prevent coder variance
Coders may vary in the way they assign codes to variable values, i.e. each of them uses the same coding scheme in a slightly different way. This results in so-called “coder variance”. Coder variance is a specific source of non-sampling error (i.e., error additional to the statistical “sampling” error) and may cause systematic deviations of the sample.
Coding of textual information is a complicated cognitive process and the coder may pose a significant influence on the information that appears in the database, as well as become a source of systematic error. That is why the implementation of complicated coding schemes often requires the construction of a theoretically and technically well-founded design and requires specific coder’s competencies and training.