CPA2.1 Data Acquisition and Ingest

Purpose

The main purpose of this process area is to acquire, appraise and select data and metadata based on a set of defined criteria; i.e. to identify the data and information that the repository will preserve [ISO 16363]. The process area also contains the services and functions that accept data from a data producer/depositor and transfers (ingests) the data into the archive or repository for long-term preservation to ensure that a high quality data package can be served to users.

An established set of criteria will aid the repository in determining the types of information that it is willing, or required to, accept. This is necessary in order to make it clear to funders, depositors, and users what responsibilities the repository is taking on and what aspects are excluded. It is also a necessary step in defining the information which is needed from the information producers or depositors [ISO 16363]. The ingest process sees quality checks performed on the data and documentation and adds information (value) that is relevant for the preservation, discovery, and reuse of the data.

CC2.1: Capability Completeness of Data Acquisition and Ingest

  1. Initial: There is some awareness of the need for planning, but this is not fully described or communicated. There is an intent to meet the specific objectives but there is no evidence of a procedure in place to do so.
  2. Partial: There is at least one required activity that is not present or not at a repeatable level. There is evidence that the direction the organisation is taking will lead to a complete capability in this area.
  3. Complete: All required activities are shown to be present in the organisation and at least at a defined level.

 

SO2.1.1: Define content coverage

To establish, maintain and communicate some criteria that aid in determining the types of information that the repository is willing, or required to, accept.

RA2.1.1.1: Methods for acquisition and selection of data

The repository has a methodical approach to identify and communicate with potential data depositors.

(0) Not defined:

No methodical approach

(1) Initial:

Some ad hoc approaches towards relevant communities;no procedures for approaching and communicating with potential data depositors; no agreements with government, organisations or other institutions.

(2) Repeated/partial:

Established regular contacts with some communities and users, but lacks formalised agreements to support the acquisition and selection methods.

(3) Defined:

Formal agreements in place, either with government, funders or research institutions (e.g. funded researchers are obliged to deposit data at the repository).

(4) Managed:

Agreements and methods are regularly reviewed and updated, and are monitored for compliance with policies, processes and procedures.

(5) Optimised:

Contact and communication methods/mechanisms are regularly reviewed based on surveys or other outreach mechanisms towards Designated Community and other relevant stakeholders.

 

RA2.1.1.2: Documentation/ Metadata requirements

The repository clearly specifies the information (documentation, metadata) that needs to be associated with the data that is to be deposited. This is necessary so that the deposited information can be interpreted and re-used. [maps to: Annex 2, section 1]

(0) Not defined:

Not specified; no awareness.

(1) Initial:

There is some awareness of the documentation and metadata that is needed for deposit, but it is not formalised; information communicated to users/depositors on an ad hoc basis.

(2) Repeated/partial:

Documentation and metadata requirements implicitly defined by acquisition and deposit activities and routines; no formal or explicit requirements in policy or other written documents.

(3) Defined:

A written formal specification of required information is explicitly defined (e.g. in a collection policy); requirements are compliant with metadata standards that are used and can be understood by Designated Community (e.g. DDI); metadata requirements are accessible and communicated to users/depositors.

(4) Managed:

Documentation and metadata requirements are aligned with policies and other processes and procedure documents. There are regular reviews and assessments (of success) of the information requirements.

(5) Optimised:

Regular reviews and updates of requirements based on technology watch, monitoring of, and communication with Designated Community and other relevant stakeholders.

RA2.1.1.3: Collection policy

The repository uses a collection policy to address and guide the data acquisition, selection and ingestion of data and metadata.

(0) Not defined:

Not applicable, or there is no awareness of a collection policy.

(1) Initial:

There is no collection policy; acquisition and ingestion is performed on an ad hoc basis.

(2) Repeated/partial:

A collection policy is not formally defined, but there are some repeatable procedures in place; acquisition and ingestion follow a regular pattern - they have developed to the stage where similar procedures are followed by different people undertaking the same task.

(3) Defined:

A collection policy is defined and it is connected to specific processes and procedures.

(4) Managed:

The collection policy is monitored and measured for compliance with processes and procedures; actions are taken where processes appear not to be working effectively or not to be in accordance with the policy.

(5) Optimised:

At regular intervals processes and procedures are measured and assessed; processes, functions and mechanisms are under constant improvement and continuously integrated into the collection policy.

SO2.1.2: Receive submission

The repository has in place functions that provide the appropriate storage capability or devices to receive a deposit/submission from a depositor/ data producer.

RA2.1.2.1: Systems for submission

The repository has in place a system (interfaces, devices) that can properly deal with the submission of information from a depositor.

(0) Not defined:

No devices or interfaces for data depositing are in place.

(1) Initial:

Information is being deposited on an ad hoc basis, with case-by-case agreements; no written procedures in place; deposits are accommodated to the individual depositor; information is deposited without considering formats, data and metadata quality, virus checks, integrity/authenticity, etc.

(2) Repeated/partial:

Some devices/interfaces for deposit in place; to a large extent dependent on personal contact between repository and depositor (guidance); minor automation.

(3) Defined:

A complete set of devices/interfaces for information deposits are in place; written processes and procedures are in place; deposit procedures to a large extent automated.

(4) Managed:

Deposits systems are monitored and measured; adjustments to procedures made accordingly.

(5) Optimised:

Deposits are monitored and measured; regular contact with user communities; deposit devices, interfaces and procedures are adjusted and optimised based on technology watch, monitoring of, and communication with Designated Community and other relevant stakeholders.

 

RA2.1.2.2: Authentication and authorisation

The repository uses AAI (Authentication and Authorisation Infrastructure) or other direct or federated user authentication approaches to appropriately verify the identity of the depositor [maps to: Annex 2, section 2]

(0) Not defined:

No authentication approach in place.

(1) Initial:

There is some awareness of the need to implement authentication approaches; deposit identity control is performed on an ad hoc, case-by-case basis.

(2) Repeated/partial:

An authentication infrastructure is emerging by repeated use of authentication approaches; lacks standardisation and formalisation.

(3) Defined:

The organisation uses digital identities, identity management, authentication and authorisation to control deposits and deposit identities; AAI is formalised, systematised and documented.

(4) Managed:

All AAI mechanisms and functions are measured, assessed and regularly reviewed and updated.

(5) Optimised:

All AAI mechanisms and functions are monitored and measured; there are systemised reviews and updates of AAI systems based on technology watch and formal, regular communication with Designated Community and other relevant stakeholders.

 

RA2.1.2.3: Requests of provenance information

The repository has mechanisms and procedures in place to appropriately verify the source/ownership of the information that is being deposited. The provenance information will provide proper citations to data when it is reused.

(0) Not defined:

No mechanisms; no awareness of provenance information

(1) Initial:

Awareness of the need to collect provenance information; requests and collection is done in an unsystematic and ad hoc manner; no agreements or other formal documentation is in place.

(2) Repeated/partial:

Provenance information requested repeatedly and consistently, but no formalised procedures or mechanisms in place.

(3) Defined:

For all deposits there are requirements for provenance information; mechanisms, documents, and formalised agreements are in place; processes and procedures are documented.

(4) Managed:

Mechanisms for requesting and providing provenance information are regularly reviewed and updated.

(5) Optimised:

Mechanisms for requesting and providing provenance information are monitored and measured; there are systemised reviews and updates of provenance systems based on technology watch and formal, regular communication with Designated Community and other relevant stakeholders.

RA2.1.2.4: Citations

The repository offers and provides functions and mechanisms for proper data citations.

(0) Not defined:

No citation practices are evident.

(1) Initial:

Citations are offered when enquired by the depositor; ad hoc and case-by-case approach; no practices or strategies are written down.

(2) Repeated/partial:

Citation practices are being repeated and offered regularly; lacks formalisation and systemisation.

(3) Defined:

Citations are required and offered to all depositors; formalised through templates or other written documents; processes and procedures are documented.

(4) Managed:

Citation mechanisms are regularly reviewed and updated.

(5) Optimised:

Mechanisms for requesting and providing citation are monitored and measured; there are systemised reviews and updates of citation mechanisms based on technology watch and formal, regular communication with Designated Community and other relevant stakeholders.

 

RA2.1.2.5: Conditions placed on content, deposit licenses

The repository has in place mechanisms and functions that allow the depositor to place access conditions on the information that is being deposited.

(0) Not defined:

No awareness; no mechanisms/procedures in place.

(1) Initial:

Some awareness of the issue; conditions of use are set up and agreed upon when required by the depositor, but most deposits are done without any conditions on use.

(2) Repeated/partial:

Most depositors are offered an opportunity to define conditions of use on the information they deposit, but conditions are not formally defined; no template or set of predefined categories.

(3) Defined:

All depositors are offered the opportunity to set conditions of use on the information that is being deposited; a set of access conditions are formally defined in categories or a template.

(4) Managed:

Regular review and updates of set of conditions; aligned with high level policies.

(5) Optimised:

Regular review and updates of set of conditions; formalised feedback mechanisms and cooperation with user groups and other relevant stakeholders (e.g. how to deal with funder policies that have open access requirements).

 

RA2.1.2.6: Legal transfer of custody; agreements on rights/responsibilities

The repository has in place agreements that confirms the legal transfer (or other consensual agreements) of the information that is being deposited; the agreement includes clear definition of roles and responsibilities of repository and depositor (i.e. the repository legally takes control of the deposited material so that they can make the necessary changes to the data to prepare it for long-term storage and to distribute it to their consumers). The contractual and legal regulations makes sure that the deposited material does not infringe any intellectual property rights (IPR) of any other person(s) or institution(s) (e.g. it is not derived from a licensed or commercial product).

(0) Not defined:

No agreements on legal transfer of custody.

(1) Initial:

Awareness of legal custody, but no formal or written agreements in place; some agreements are made ad hoc, on an individual basis; some/most deposits are made without any kind of agreements on legal transfer of custody.

(2) Repeated/partial:

Most of deposits are made with agreements in place, but adjustments are made on individual basis when needed. Agreements are not formalised into ‘templates’ that are being used regularly and systematically.

(3) Defined:

Formal, written agreements and contracts in place; responsibilities and legal transfer of custody clearly defined. Contractual templates are being used consistently; legal and contractual framework is regularly reviewed and updated; all legal and contractual regulations are aligned to higher level policies; roles and responsibilities are identified and maintained.

(4) Managed:

Monitoring of the usage of agreements and contracts; actions are taken where contracts/agreements appear not to be working effectively or are not in accordance with higher level policies; reviewed and updated regularly.

(5) Optimised:

The usage and success of transfers and agreements are continuously assessed; monitoring of wider legal framework (e.g. national and EU regulations); regular and formalised contact with relevant stakeholders.

RA2.1.2.7: Receipt or request to resubmit

The repository have in place mechanisms/procedures that sends a receipt / confirmation of deposit to the depositor; if quality assurance (see below) shows errors or insufficiency in the deposited material, the repository requests a resubmit.

(0) Not defined:

No awareness; no mechanisms/procedures in place.

(1) Initial:

Some awareness; some depositors receive receipts, when requested; no procedures contacting depositor for re-submitting when errors are identified.

(2) Repeated/partial:

Most depositors receive receipts/confirmation of deposit; errors and insufficiencies are attempted to be corrected by contacting the depositor, but there are no written procedures for depositor contact and/or re-submitting.

(3) Defined:

Written procedures and formalised systems for receipts and requests to re-submit are in place. The system for confirmation/receipts may be partly automated.

(4) Managed:

Receipts and confirmation systems are regularly reviewed and updated. The receipt system may be partly or fully automated.

(5) Optimised:

Receipts and confirmation systems are measured, monitored and updated regularly based on systematised user feedback mechanisms. The receipt and control system may be partly or fully automated.

 

SO2.1.3: Quality Assurance

The repository has quality control checks to ensure the safety, completeness and understandability of data and metadata deposited.

RA2.1.3.1: Completeness and correctness

The repository has processes and mechanisms in place which verifies the deposited material for completeness and correctness. That is, the repository inspects data files to ensure that variables and values are accurate according to the documentation supplied; that variables and values are sufficiently labelled for reuse; and that variable names in a dataset match variable names in a codebook. Data can also be checked for completeness and correctness using checksums (or similar approaches). This is necessary in order to detect and correct errors in the material and/or potential transmission errors between the depositor and the repository.

(0) Not defined:

No awareness; no inspections of data. Data deposited without data inspection.

(1) Initial:

Some awareness of data inspection; there are non-systematised (manual) checks of deposited material (e.g. checked for basic metadata completion; visual checks of data, etc.), but the procedures are ad hoc. There are no written procedures or process documents and there is a lack of mechanisms for detecting technical errors in deposit/transmission. There are no formalised processes and procedures for rectifying data and/or metadata.

(2) Repeated/partial:

There are non-systematised (manual) checks of deposited material in place; processes and procedures are repeated, but they are not formalised or documented. Rectifications are performed repeatedly, either by the repository or by returning data to depositor.

(3) Defined:

Systemised checks of all data and metadata are in place; procedures and processes defined and formalised in written documents. Functions and mechanisms are in place for rectifying data and metadata, including processes for contacting and/or returning data to depositor when necessary. Some processes may be automated.

(4) Managed:

All completeness and correctness checks and modifications are measured and registered; processes and procedures are reviewed and updated regularly. Automated processes are implemented where appropriate.

(5) Optimised:

All completeness and correctness checks and modifications are measured, registered and assessed regularly; processes and procedures are reviewed and updated regularly based on technology watch and formalised communication with Designated Community.

RA2.1.3.2: Authenticity checks of deposited material

The repository have adequate specifications enabling the recognition (and parsing) of the material that is to be deposited. The repository must be able to determine what the contents of the deposited material are with regard to the technical construction of its components (its authenticity).

(0) Not defined:

No specifications; no awareness of authenticity checks.

(1) Initial:

Awareness of the need to check the authenticity of the deposited material; some manual checks performed, but in an unsystematic/informal and ad hoc manner.

(2) Repeated/partial:

Material are checked for authenticity on a regular basis and follow a regular pattern, but mostly done manually and individually; there are no formalised process descriptions or procedures in place.

(3) Defined:

Deposited material are checked for authenticity; checks are done systematically and in an automated/semi-automated way; processes and procedures are formalised in written documents; documentation of the validity and authenticity of the object/material construction is provided.

(4) Managed:

Processes and procedures are aligned to strategies and policies; authentication checks are regularly reviewed and updated; authenticity checks are highly automated.

(5) Optimised:

Checks and procedures are measured and regularly updated; processes and procedures are aligned to technology watch routines; full automation where possible.

 

RA2.1.3.3: Quality control standards and reporting mechanisms

There are quality control standards and reporting mechanisms in place to include details of how any data and metadata issues are resolved (e.g. data/information returned to the data provider for rectification, fixed by the repository, noted by quality flags in the data file, and/or included in the accompanying metadata.)

(0) Not defined:

No quality controls in place.

(1) Initial:

No formal criteria exist, application of standards are applied inconsistently. There are no reporting mechanisms in place.

(2) Repeated/partial:

Quality controls reports are performed repeatedly on individual basis, but there is a lack of coordination and processes and mechanisms are not fully documented. Issues are resolved ad hoc.

(3) Defined:

Formal quality control standards and reporting mechanism are in place, and all issues are logged and resolved.

(4) Managed:

There are regular assessments and reviews of the quality control standards and reporting mechanisms; automation are implemented where appropriate.

(5) Optimised:

All controls and reports are measured and regularly reviewed; processes and procedures are aligned to technology watch routines and changes are implemented accordingly; full automation where possible.