4.1 Metadata system(s)
The Integrated Metadata System
The Integrated Metadata System is constituted by several subsystems: Concepts, Statistical Classifications, Statistical Sources (including the components: Methodological Documents, Data Collection Instruments, and in future Administrative Sources and Questions) and Variables.
Fig. 4. Macro Architecture of the Integrated Metadata System
Fig. 5. Conceptual Model of the Statistical Metadata System
Purposes of the system
The main purposes of the integrated statistical metadata system are:
- To support the whole life cycle of surveys;
- To act as a central repository for statistical metadata serving as a source for other databases that support: design, production, dissemination of statistics and management;
- To establish terminology for statistical metadata;
- To constitute an instrument for statistical harmonisation and coordination of the NSS, standardising the documentation of surveys, among other elements;
- To implement a homogeneous environment for its technological infrastructure.
Fig. 6. Conceptual Model of the Concepts Subsystem
Concept - unit of knowledge created by a unique combination of characteristics (ISO 1087-1:2000, Terminology work - Vocabulary - Part 1: Theory and application).
The concepts and definitions recorded in the database are classified by subject area and organised in glossaries. Each glossary corresponds to a theme in the Official Statistics website.
The main attributes of the concepts are: code, name, definition, notes on the definition and source. Other attributes are required for the management of the system, such as status (proposed, in use, SC-approved), dates on which it was proposed, came into use and was approved by the SC. It is possible to establish a relationship between two concepts. Of these, synonymy and homonymy have already been implemented.
There is a generic glossary of concepts used throughout statistical activity entitled "Metadata Terminology" and a list of abbreviations and acronyms used in the documentation of surveys.
There is a plan to enlarge the system so that other types of relationship can be implemented which enable us to view the concepts of a particular area in the form of a conceptual system.
As a result of the integration of the different subsystems, the detail page of each concept shows its use in methodological documents, classifications and variables.
The concepts are available on the Official Statistics website, with access from the home page, and are searchable by alphabetical order in each glossary. An advanced search was implemented with the possibility of the combination of more than one search criterion.
It is in course the translation to English, of the concepts registered in the database. 20% of the concepts are, already, available in English.
The conceptual model of the classifications subsystem was developed on the basis of the Neuchâtel model, a simplified version of which is shown in Figure 7.
Fig. 7. Conceptual model of the Classifications Subsystem
The main purposes of this subsystem are:
- To constitute a reference for the NSS on national, EU and international nomenclatures and classifications used in statistics;
- To constitute an instrument to harmonisation and coordination for the statistical information;
- To constitute a management tool for nomenclatures and classifications.
Essentially, it provides access to three different types of information:
- National and international classifications and their description;
- Code lists (other grouping types);
- Correspondence tables.
Classification family - comprise a number of classifications, which are related from a certain point of view (e.g. products, economic activities, countries, etc.)
Classification - describes the ensemble of one or several consecutive classification versions. It is a "name" which serves as an umbrella for the classification version(s).
Classification version - a structured list of discrete, exhaustive, mutually exclusive categories defined by codes and designations intended to typify all units of a certain population in relation to a defined property. A classification version has a certain normative status and is valid for a given period of time.
Classification level - a level of aggregation of a classification; all categories at the same level have the same code structure. In a hierarchical classification the items of each level but the highest (most aggregated) level are aggregated to the nearest higher level. A linear classification has only one level.
Classification item - represents a category at a certain level within a classification version or variant.
Correspondence table - relationship between different versions of the same classification or between versions of different classification
This subsystem allows:
- To consult and export classification versions, respective correspondence tables and indexes, when they exist;
- To consult a set of normalised attributes that characterise each classification version;
- To consult other specific and relevant attributes in determined classification versions;
- To consult documentation related with each classification version;
- To consult variants of a classification version;
- To consult, by date, "floating" classification versions.
The classifications are accessible through the home page of the Official Statistics website.
The conceptual model is based on international standard ISO/ IEC 11179, "Information
Technology - Specification and Standardization of Data Elements" (Figure 8).
Fig. 8. Conceptual Subsystem of the Variables Subsystem
The variables subsystem provides a database of variables standardised and harmonised with their respective concepts, classifications, explanatory notes and calculation formulae.
The main purposes of the variables subsystem are:
- To support the questionnaire and survey design;
- To improve statistical coordination;
- To support the dissemination of statistical data;
- To assist the definition of normalized and/or harmonized variables;
- To promote comparability of data by using normalized variables.
Variables family - a classification for variables in general to facilitate the search for variables in the system.
Property - characteristic or attribute common to all members of an object class; a property is a concept.
Objects class - a set of ideas, abstractions, or things in the real world that can be identified with explicit boundaries and meaning whose properties and behaviour follow the same rule.
Object classes in this subsystem are:
- Statistical units;
Conceptual variable - a property of an object class described independently from any particular representation.
Representation class - a component of the definition of the variable indicating the type of data it represents (code, ratio, quantity, etc).
Value domain - a set of permissible values and their associated meanings. The value domains may be:
- Categorical (or discrete);
Variable - the smallest identifiable unit of data in this subsystem for which a value domain, a unit of measure, versions, permissible values can be specified.
Statistical indicator - a data element that represents statistical data for a specified time, place, and other characteristics. It consists of a cross-reference between an aggregate variable and classification variables called dimensions. Each indicator has at least two dimensions: time and geography.
Example: Resident population by place of residence, sex and age group.
At present, all the statistical indicators disseminated on the Official Statistics website, are registered in this subsystem, with complete metadata in Portuguese and English.
Data collection instruments subsystem
The data collection instruments subsystem stores and publishes in user interface, all the questionnaires (files still in preparation) that represent an instrument of reference on data used in NSS surveys. Images of questionnaires are available too, as well as some of its characteristics as: frequency and the variables that it observes.
The main purposes of the collection tool subsystem are:
- To constitute a repository of data collection instruments used in NSS surveys;
- To constitute a management tool for collection instruments.
There are basically two types of statistical data collection instruments:
I. General characterization
Code/ Version /Approval date
Statistical activity / Statistical domain
Relation with EUROSTAT/ other entities
Type of survey
Type of data source
Begin/ end date
II. Methodological characterization
National and international recommendations
VI. Data collection instruments
VII. Abbreviation and acronym
Fig. 10. Standard format of the methodological document
Fig. 11. Conceptual Model of the Methodological Document Subsystem
Survey - a statistical activity belonging to a predefined statistical method and involving the collection, processing, refinement, analysis, study and dissemination of data on the characteristics of a population. Four basic types of surveys are considered: sample survey, census, analytical study and statistical study.
Questionnaire - an identifiable instrument containing questions designed to collect data from respondents.
Method - a structured approach to solving a problem.
This entity contains the characterisation of methods of collecting data, designing samples, allocating answers and estimating and calculating errors, among others.
Universe - all the elements (people, entities, objects or events) with a given common characteristic.
Sampling frame - a list of units belonging to a given population used to select samples. Sampling frame must be characterised by the design methodology, updating system and quality control.
Sample - subset in a population or universe.
Fig. 12. Interaction between metadata subsystems and the life cycle of statistical operations
I - Inserted
C - Consulted
4.2 Costs and Benefits
All the system was implemented in-house, with one exception: the prototype system for the management of methodological documents that was implemented in outsourcing. This way we reduce costs and the maintenance of the system is easier. On the other hand, as IT technicians are not enough to all the agency needs, implementation time increases.
Since 2003, three IT technicians/year, on average, have been assigned to the development of the statistical metadata system.
4.3 Implementation strategy
The different subsystems, of which the general lines had been presented and approved by the Board and the Council of Directors in May 2002, were then detailed and implemented. Each one's information requirements, user interfaces, uploading and updating procedures, rules on content and plans for the use of existing information were defined in the details of these subsystems.
Implementation priorities are defined on the basis of the institution's needs.
After the general lines were approved for the metadata system mentioned in point 1 "Metadata Strategy", it was implemented as follows:
- We studied the implementation of metadata systems by other statistical institutes, such as that of Statistics Canada (2002-2004).
- We defined the system's conceptual model to integrate its different components.
- An existing subsystem of statistical concepts implemented in 1994 was initially thought to be appropriate.
- We implemented a classification subsystem (2003-2006).
- We defined a standard format for methodological documents in surveys (2003-2004), which was approved by the Statistical Council for documenting all NSS surveys (2005).
- We implemented a prototype subsystem to store methodological documents (2003-2004).
- We reformulated a questionnaire management subsystem implemented in 1997 (2006-2007).
- We implemented the variables subsystem (2004-2007).