3.1 Metadata Classification
The essence of Stats SA's meta-information system is captured by how the organisation uses the metadata. Metadata is used internal to the organisation to enable statistical production processes. This means that metadata is used during various stages of statistical production as essential input to production processes. However, the production processes in turn, produce metadata. This metadata is also important in documenting the trail of activities during the statistical production process. The documentation of production activities informs related metadata issues such as the assessment of data quality and its interpretation.
Categories of Metadata
Because of this diversity of metadata usage, it was decided that contents of the meta-information system should be aligned with these usage activities. The natural progression of this decision was to undertake a project to classify all of the organisation's metadata. The following is a list of the categories of metadata adopted by Stats SA:
- Survey Metadata
Often referred to as dataset metadata, Survey metadata is used to describe, access and update dataset, data structures. Stats SA chose to call this type of metadata survey rather than dataset because some of the metadata such as information about "the population which the data describe" refer to the broader aspects of the survey, and not only the dataset.
- Definitional Metadata
This is metadata describing the concepts used in producing statistical data. These concepts are often encapsulated into measurement variables used to collect statistical data. Descriptive text is used to define individual concepts, however the concepts are further grouped into logical topics. These main topics are effectively classifications of data. Hence, included in Stats SA's package of definitional metadata classifications drawn from different study domains.
- Methodological Metadata
These metadata relate to the procedures by which data are collected and processed. These may include Sampling, Collection methods, Editing processes, etc
- System Metadata
System metadata refers to active metadata used to drive automated operations. Some of the examples of system metadata are:
- Publication or dataset identifiers date of last update
- File size
- Mapping between logical names and physical names of files
- Dataset input flows
- Access methods to databases
- Coordinates as kept in metadata store
- Table and column definitions schema and mappings of data
- Operational Metadata
This is metadata arising from and summarising the results of implementing the procedures. Examples include Respondent burden, Response rates, Edit failure rates, Costs and other quality and performance indicators, etc
The different components of Stats SA's meta-information system are logically grouped according to these categories of metadata. This means that the database for the meta-information system has different data structures corresponding to these metadata categories. We have recently (June 2007) finished developing the first metadata component, the survey metadata capturing tool, which is the subject of this case study.
How Metadata Fit into Other Organisational Systems
As already stated, the development of Statistics South Africa's metadata management system (Meta-Information system) is part of a larger system, the ESDMF. The central components of the ESDMF will follow the completion of the meta-information system, because the ESDMF is driven by the metadata. Although the ESDMF is a new system, it is merely a means to centralize the organisation's disparate statistical information systems. Figure 6 below shows the conceptual ESDMF subsystems and how they are placed relative to other organizational subsystems. The metadata subsystem supports the entire statistical cycle.
Figure 6: Conceptual components for the ESDMF in relation to other subsystems
3.2 Metadata used/created at each phase
Metadata are used and/or produced in each phase of the statistical value chain. This strong link between the between the SVC and metadata informs all the development of the metadata subsystem.
Stats SA's Statistical Value Chain
Statistics South Africa's core areas, i.e., those divisions in the organization responsible for the production of statistics, have up to now operated using different approaches. Although it is generally understood in the organization that there are many commonalities in the way different divisions conduct their work, no attempt has been made to formalize a standard statistical production process for the entire organization. The development of the SVC for the organization is a move to correct this situation. The SVC is a generalisation of the activities that need to take place from the beginning to the end of a statistical production process.
Stats SA envisions its statistical cycle along the lines of Michael Porter's Value Chain Model Michael Porter explained this model in his 1985 book, "Competitive Advantage: Creating and Sustaining Superior Performance". Hence we refer to our statistical cycle as the SVC. The value chain categorizes value adding activities of an organization. Figure 7 below is a schematic diagram of the main phases of Stats SA's SVC.
Figure 7: High level phases of Stats SA's Statistical Value Chain
The SVC was designed to be general, catering for most scenarios of statistical production. For example, it is clear that not all the phases of the value chain will be used by all surveys. Figure 8 below shows aflowchart of statistical production within the context of the SVC. It can be seen that old frequent surveys might not follow the same path as new frequent or once off surveys.
Figure 8: Flowchart of a statistical production using phases of the SVC
A high level description of the main phases of the SVC was given in section 2.4 above. In this section we give a detailed view of the activities involved in each phase.
The Need phase consists of the following activities:
Determine the need
The objectives and purpose for doing the particular survey or research must be defined. This starts with conducting interviews with the organisation or individual(s) requesting the new survey. This is an iterative process that concludes with a definition of a statement of need.
Determine Information Requirements
A need for a survey or study is triggered by requirements for information that solves a given problem. A clear determination of the nature and extent of this information or data is needed. This is done through consultations with domain experts from the community in need of the information.
Develop Budget and Plan
Similar to any project that requires resources, a statistical production project has to have a cost-benefit analysis as a foundation of its business case. During this phase, only a high level plan is produced.
Obtain Financial Support
Generally, Stats SA's projects are big and critical; thus they need huge financial investments. Because the government pays for them, an intensive process of budget approval has to be undertaken in order to ensure accountability.
Stats SA projects are funded by the National Treasury under the Ministry of Finance. For large projects to go ahead, ministerial approval is required.
The following activities are contained in the Design phase:
Develop Detailed Project Plan
The output of the Need phase consists of high level aspects of the proposed survey. All Stats SA's surveys must go through detailed planning. For new priority projects, the responsibility for such planning lies with the organisation's Programme Office. The Programme Office has the overall responsibility for running the project to completion, after which, the future running of the project (in the case of frequent surveys) is handed over to the survey area.
Develop Survey Methodology
The goal of the survey methodology is to ensure that the statistics collected during the survey are reliable and representative of the survey's target population. For existing surveys, the survey methodology is often already in place. For new and re-engineered surveys, new survey methodologies are developed.
Design and Test Questionnaires
Questionnaire design is aimed at ensuring that the required information from a survey is realized. It consists of getting both the content and the layout of the questionnaire correct. This process is iterative between constructing survey questions and testing whether the responses to the questions asked address the problem the survey is intended to solve. Questionnaire testing is initially done "behind-the-glass", during which employees of the organisation are randomly selected for participation. Thereafter, pilot tests are conducted on the field to small population groups in the same way the actual survey will be conducted.
Design Operational Requirements
Survey operations are concerned with the tasks of getting data from respondents or other data sources. Operational requirements must detail all the technical and logistical issues that need to be sorted out in order to have a successful survey. These vary from resource issues to technologies needed to conduct the survey.
Design Computer System
The system to be used during the statistical production process consists of many related sub-systems that may be implemented through computer technology. Data collected during a statistical survey is captured in computer system for processing. A number of technologies are required to ensure that data are moved from their sources of collection to the computer.
Activities contained in the Build phase are as follows:
Build a Collection Vehicle
Stats SA collects statistical data through one of the following survey methods:
- Sample survey using questionnaires
- Administrative surveys, using IT communications methods to access data stored in other organisations' databases.
Building a collection vehicle consists of ensuring, through building customised or procuring all the necessary infrastructure and items for the conduction of a survey.
Build a Technology Solution
A technology solution should include all the technological components required to support the entire SVC. These may include hardware such as scanners and Optical Character Recognition (OCR) tools for capturing questionnaire-based data, database management systems, data analysis tools and information dissemination tools.
Test Technology Solution
Before a technology solution is put into production, it must be tested by the prospective users. This is to ensure that the functionality required by the users is included in the system. Also, issues concerning ease of use, integration of systems are also addressed. At a technical level, the testing of the system may lead to the identification of system bugs that may have been missed during the technical tests done by the developers.
The implementation of a solution means that it is deemed ready to be used to perform productive work. Therefore, users get to be trained on how to use the system and thereafter certain people are granted access rights to the system.
Contained in the Collect phase are the following activities:
Enumerators must be highly trained so that they are able to explain to the respondents the reasons for collecting data and how they were chosen to be part of the survey and the way such information is planned to be used to improve functions of the agency and improve standards of living; whether responses to the collection of information are voluntary or mandatory (citing authority: Statistics Act); the nature and extent of confidentiality to be provided (citing authority: Statistics Act); an estimate of the average respondent burden together with a request that the public direct to the agency any comments concerning the accuracy of this burden estimate and any suggestions for reducing this burden. Respondent management must be done in ways that reduce the burden of survey on the respondent. Burden reduction includes ensuring that re-visits to respondents are kept at minimum and the questionnaire need to be of reasonable length.
Post Out refers to the process of notifying respondents by sending letters via the post detailing this information. Administrative data does not have this requirement, though legal arrangements are put in place in advance e.g. Memorandum of Agreement, Service Level Agreements etc., for the other party to be able to provide the data. When a survey is conducted by enumerators visiting respondents, the respondents must be notified by Stats SA about the pending survey. This notification must include information such as the objective of the survey, the date(s) when the enumerators will be visiting, etc.
Data acquisition at Stats SA includes both the direct (e.g. Sample Surveys and Census) and administrative methods. In most direct acquisitions, data are captured on paper based questionnaires. In a few other cases, electronic media may be used. Figure 9 below shows a flowchart of how Stats SA acquires its data.
Close off Collection
The collection period is usually specified at the design stage of the survey. The end of the last day of the defined collection automatically ushers in the closure of field collection of data.
The Process phase consists of performing the following activities:
Capturing Data into Electronic Form
This applies only to questionnaire based collection methods. Questionnaires are either scanned or manually entered by data capturers into computer databases. Data collected from other electronic systems might only need to be transformed into Stats SA's data formats.
Perform Macro Edits
Macro edits detect individual errors by: (1) checks on aggregated data, (2) checks applied to the whole body of records. The checks are typically based on the models, either graphical or numerical formulae that determine the impact of specific fields in individual records on the aggregate estimates.
Item non-response may result in missing values in a survey dataset. Statistical organizations use imputation methods to calculate estimate values to fill in the missing values. Imputation is implemented using mathematical algorithms through computer programs.
Estimation of missing values should not be confused with the overall statistical estimates which form the main goal of a survey. Statistical estimates are calculated by aggregating all of the collected data. These are often called macro data, and are contrasted with micro data, which are detailed data collected from the respondents.
The primary output of the processing are "clean" datasets that are ready to be analysed. Analysis tools can only process data whose formats and structure they understand. Part of producing datasets is to package them into structures and formats that conform to Stats SA's analysis packages.
Statistical data analysis consists of the following activities:
Produce Statistical results
This is the process where results are produced based on the processing that was done on the data. The ultimate goal of any survey is to produce statistical estimates of the characteristics of the statistical unit of interest.
Validate Statistical Results
This is where estimates are assessed against expectations, comparing data with the one from previous period, and assessing quality measures to ensure good quality data.
Interpret Statistical Results
Numbers are meaningless if they are presented without any explanation accompanying them. This is one quality dimension that we cater at Stats SA, that all data that get released should be accompanied by the corresponding metadata.
Prepare Content for Dissemination
This is the process where actual particular measures are taken to ensure that content from the survey does not disclose information concerning any identifiable respondent. This includes: a) for micro data: remove respondent, content reduction, content modification, b) for tabular data: sensitive cells correction methods such as cell collapsing or suppressing by data providers.
Perform Quality Control
This process entails making sure that all quality measures in SASQAF have been implemented correctly and the results thereof are known.
- Receive and Validate Content
During this process, the dissemination team goes through a checklist of what was supposed to be accomplished and whether it was done accordingly and correctly. The content received by the team consists of macro and micro data, and other products such as published reports.
- Manage Dissemination Repositories
Data to be disseminated are kept in databases (dissemination repositories), from which they are extracted when disseminated. These repositories store datasets (including both micro and macro data), reports and other documents.
- Pre-release for Publishing
This process entails preparations before releasing regarding tables, corporate formatting standards, electronic distribution and hard copy outputs
- Manage First Release
This is where distribution media are managed and controlled in order to ensure that different categories of users of statistical information get access to relevant information. Release timelines are handled within this process.
- Handle Customers
Handling customers is part of customer relationship and stakeholder management. A system to handle customer enquiries exists. Stats SA's Support and Informatics Services unit handles customer enquiries, categorises main users and other users, consult users to determine needs and make sure data is distributed timely to users.
Metadata Description Matrix
The implemented Survey Metadata Capture Tool of the ESDMF captures the following metadata:
Descriptions are provided for section headings.
1. Active Metadata Set
The file identifier and status of the current/active metadata set is displayed immediately under this section. In other words, the metadata set that the user is currently capturing, editing or viewing.
The elements accessible from this section collectively provide a brief description of the survey.
The Overview section comprises the following items:
3. Generic Information
The elements accessible from this section collectively provide generic information about the survey time frames.
The Generic Information section comprises the following items:
Series Time Frames
4. Primary Data Source
The elements accessible from this section describe external inputs to the survey.
The Primary Data Source section comprises the following item:
External and Internal Data Sources
The elements accessible from this section collectively describe the activities conducted and the methods and processes used which are specific to the survey.
The Methodology section comprises the following items:
Seasonal and Working Day Adjustment
6. Data Quality Report
The element accessible from this section provides a hyperlink to the data quality report for the data release.
The Data Quality Report section comprises the following items:
The elements accessible from this section provide hyperlinks to additional documentation related to the survey.
The Documentation section comprises the following item: Documentation
The elements accessible from this section provide information concerning the contact person who will manage enquiries related to the data or information produced by the survey.
The Contact section comprises the following item: Contact Person
9. Loaded Metadata Sets
This section lists the file identifiers and statuses of metadata sets created by the current user. It enables the current user to switch between his/her metadata sets.
Table 2 below shows the metadata captured with the Metadata Capture Tool against the Statistical Value Chain, with example for each stage of the SVC.
Statistical Value Chain
Brief overview about the survey that highlights the background, purpose, history and usage
Title of survey, Series status, Objective of survey, Keywords, Main users and usage
Metadata file identifier, Metadata version
Target population, Main topics
Survey Time Frames
Information about time frames that the life cycle of the survey will be managed
Frequency of series, start date of survey, end date of survey
Reference period, collection period, product release date
Type of Survey
Classification of a survey according to its statistical activity that involves collection, compilation and publication of statistical
Derived, Direct (e.g. Sample or Census) and Administrative
Primary Data Source
Information that gives a description about or identifies the administrative data source
Administrative data information (i.e. title of survey from primary data source, primary data source description, contact person from primary data source)
Information about processes that are put in place and methods used to collect, process, analyse and publish statistical release
Survey population, instrument design, Collection, Editing/Error detection, Imputation, Estimation, Disclosure/Confidentiality control, seasonal adjustments, revisions, Data variables, Dissemination
Methodological soundness, Integrity and Accessibility
Data Quality Report
Information about quality measures used and the errors obtained as a result of executing the statistical processes
Sampling errors and Non-sampling errors
Attach any documents with extra information related to specific section of the template
Any additional documents that describe the concepts and
Table 2: Relationships between various categories of metadata inputs and different phases of the SVC
The following table shows the stage of the SVC at which metadata is used:
Statistical Value Chain
Metadata File Identifier, Metadata version
Objective, Main topics
Title of survey, Series number, Series status, Abstract, History of survey, Keywords, Users and usage
Survey Time Frame
Collection period, reference period
Frequency of series, Start date of survey, Product release date, End date of survey
Type Of Survey
Derived, Direct (e.g. Sample or Census) and Administrative
Methodological soundness, Integrity, Accessibility
Primary Data Source
Administrative data information (e.g. title of survey from primary data source, primary data source description, contact person from primary data source)
Survey population, Instrument design, Sample design, collection, Quality evaluation, Data variables,
Methodological soundness, Integrity and Accessibility
Quality evaluation, Data Editing, Imputation, Seasonal adjustment, Revisions, Data variables
Quality evaluation, Estimation, Data variables
Quality evaluation, Disclosure/Confidentiality Control, Dissemination methods
Data Quality Report
Sampling errors and Non-sampling errors
Table 3: Metadata produced with groups of metadata with examples for each group