|2. Modelling the Information and Processes of a Statistical Organization (Central Statistical Bureau of Latvia)||Central Statistical Bureau of Latvia||4. Statistical Metadata Systems (Central Statistical Bureau of Latvia)|
3.1 Metadata Classification
The CSB of Latvia doesn't have a formal classification of metadata. However it could be classified as follows (5 groups):
1. Dissemination metadata - all metadata is foreseen for end users, such as classification, data interpretation and etc.
2. Metadata on quality.
3. Metadata for data collection purposes. This metadata is used by interviewers and respondents. For example: various instructions for interviewers and respondents for coordination their activities; interviewer's guidelines and etc. This group of includes a great amount of information, therefore should be presented as a separated one.
4. Metadata for statistical data processing purposes. All metadata used in an IMD SDMS and allows producing statistical data through the cycle of statistical data processing.
5. Operational Metadata (paradata). Data about all statistical processes at NSI. There is no relation with survey's paradata.
6. System metadata - all information referring to the IT environment, including necessary information for supporting this environment.
3.2 Metadata used/created at each phase
Metadata used at each phase of SBPM of CSB. The subject of this case study is IMD SDMS. For this reason only the main points for those sub-processes that are supported by IMD SDMS will be described, namely sub-processes marked in blue on SBPM.
The most important sub-processes which affect statistics quality at a greater degree are:
2.5 Design statistical processing& workflow methodology;
3.1 Build data collection instrument;
3.2 Build or enhance process components;
4.3 Run collection;
5.3 Analyze, validate, review, edit & impute
7.1 Update output systems
7.3 Manage release of dissemination products
Phase 3 - Build.
This phase provide for the real test before running of data collection in phase 4.
In the first case this phase is not complicated if:
- all data collection instruments and necessary components already have been built and it is not a first iteration of phase;
- all workflows configured previously and production system for example for concrete survey was applied when the first collection of that survey was done.
In the second case this phase a little bit complicated when some new survey is going to be collected. The second case will be described further including six sub-processes of that phase.
These sub-processes are not sequential from top to bottom; mostly they occur in parallel and some of them are iterative.
When a design is approved and all requirements in sub-process 2.5 (this process is strongly linked to Phase 8 - Archive) are specified the sub-process 3.1come into force.
This sub-process consists of a building/improvement the various components of the production system. This process can be either time consuming if it concerns the case where a questionnaire is described in the system for the first time, and the functionality can be insufficient to cover all of the need for a given questionnaire (in a case where it is a non - standard questionnaire compared to other business statistics questionnaires) or very little time consuming in a case where existing tools are in line with requirements set by a particular questionnaire.
In that case the following steps must be realized in that sub-process:
*1. Registration of statistical survey and attachment of all necessary methodological information to survey's version*. New survey (questionnaire) should be registered in the System. For each survey a questionnaire version should be created, which is valid for at least one year. If questionnaire content and/or layout does not change, then current version and its description in IMD SDMS is usable for the next year.
*2. Description of: indicators, attributes and content of chapters of statistical questionnaire*. Each survey contains one or more data entry tables or chapters (data matrix), which could be constant table - with fixed number of rows and columns or table with variable number of rows or columns. For each chapter it's necessary to describe rows and columns with their codes and names in IMD SDMS. This information is necessary for automatic data entry application generation, data validation etc. Last step in the questionnaire content and layout description is cell formation. Cells are the smallest data unit in survey data processing. Cells are created as combination of row and column from survey version side and variable from indicators and attributes side. As an example of the fixed structure table on Figure 1 we could look at Retail Trade Statistics Questionnaire structure from Meta data point of view. All necessary survey's variables with attributes or without them must be defined into IMD SDMS. Using these variables user interfaces for statisticians are created automatically by system. The principle of creating defined by formula:
INDICATOR + ATRIBUTE (Classification) = VARIABLE, where are
ATRIBUTES = dimensions or vectors of INDICATORS. Vectors always are classifications and they could be as follows:
- Kind of activity - NACE
- Territory and etc.
Number of employees + no attribute = Number of employees total
+ kind of activity (NACE) = Number of employees in
breakdown by kind of activities
+ location (Territory classification) = Number of employees in
breakdown by territories
*3. Maintenance of validation (error detection) rules of statistical survey*.
These rules described in IMD SDMS using pseudo language.
*4. Description of conditions for aggregations and output tables of statistical survey*.
On this step micro data turns into macro data. If necessary grouping of classifications records are performing. Conditions of aggregations: SUMM, COUNT (frequence), MAX1, MAX2, MIN, MEAN.
In addition the steps mentioned below can take place as well:
*5. Description of layout for electronic statistical survey.*
*6. Description of derived variables.* IMD SDMS calculates derived variables for defined questionnaire's version by using micro data from this version or from other survey-/s version-/s. This step is used for Structural Business statistics (SBS), where all necessary variables are combined like as a quasi survey in IMD SDMS.
The main objective of this sub-process is to be sure that the workflow (from data collection to dissemination) specified in sub-process 2.5 work in practice. There is no reason to check workflow with regard to the archiving phase, because if questionnaire was described in IMD SDMS there is always no problem with archiving
This process starts with testing of production system and ends with approval of successful operating within that system. In this work process both IT developers and statisticians need to be convinced that numerous components work together through configured workflow.
This sub-process always occurs in parallel with sub-process 3.4 through small-scale testing of data collection with special respondents for these testing purposes. In this sub-process a match of business process qualitative and quantitative information is performed.
This match is possible only in case if questionnaire's version has been described before in IMD SDMS and therefore all quantitative and qualitative information about this version is available (i.e., number of variables in the version questionnaire's or list of attributes attached to the versionand etc.).
Successful results of previous processes leads to training of users and attachment of some documentation. Some kind of technical documentation can be enclosed as well.
Phase 4 - Collect
This phase is based on both: the methodology created in Design phase, namely in sub-process 2.5 and collection instruments prepared in Build phase.
The sample/sample is selected, validated and documented. Successful results of this work process leads to attachment of list of respondents.
This sub-process foreseens the checking of availability for all data collection elements, for example: the list of respondents was attached, all data input user interfaces were prepared, all questionnaires were printed, electronic surveys forms were available and etc. This work had to be done by statisticians responsible for their surveys.
This sub-process is to evidence when the first contact between respondent and statisticians was done. The first contact occurs at the beginning of the each calendar year when the list of questionnaires and questionnaire's forms are sent out to respondents.
After that we have two scenarios of behaviour:
- respondents decide to fill in questionnaire electronically and addresses to CSB for proving of his account or fill the electronic questionnaire if he already has an the account.
- respondents provide a filled-up paper questionnaire to CSB
Actually, we had discussions to show this sub-process on SBPM of CSB or not. The reason is our system has the raw data base only for data collected electronically, but when data is collected from paper's questionnaires it is classified, coded and edited simultaneously and in that case the data is inputted by advanced statisticians, which simultaneously validate the data during inputting.
As the result of that is CSP doesn't have operator's data input and in such a way CSP provides a less amount of human resources for data input.
This sub-process simultaneously is related to several sub-processes: 5.2 - Classify & code; 5.3 - Analyze, validate, review, edit & impute; 8-Archive;
The main purposes for sub-process 5.3 is: to analyse, check collected data accuracy, correct it, and get a clean data set at the end. Involved organizational units and their roles: data collection unit, respective subject-matter section (e.g. Trade statistics section).
Looking into 4.4 from other perspective, in the case if electronic surveys are also used for data collection, the situation will be slightly different. Electronic respondents (who fill in electronic questionnaires) don't get full list of validation during data input process. This approach is used for avoiding of respondents burden. Therefore in this sub-process "raw data base" of data submitted by respondents through electronic surveys happens.
Phase 5 - Process
In this phase, the statistical data is analyzed, checked and "refined".
It consists of such activities as data integration with other source. This source can be a mixture of external or internal data sources or extracts from administrative sources.
Integration process comprises data import function, supported by IMD SDMS.
For example, some variables survey of employment statistics can be integrated with another surveys variables of employment statistics. Using special applications of IMD SDMS matching with the aim of linking data from different sources can be done, where data refer to the same unit.
Sub-process 5.2 (see also sub-process 4.4)
In this sub-process the input data is classified and coded. In IMD SDMS numerous pre-defined classifications (local, national, international and validation classifications for calendar values) are maintained. During data input in IMD SDMS, system automatically provides a list with all classifications codes of some kind of classification with textual descriptions for each classification code. Respondents or statisticians just need to choose corresponding code, if a particular data entry cell foresees usage of classification.
Sub-process 5.3 (see also sub-process 4.4)
All activities in this sub-process are operated at micro data level. Within the sub-process d error checking is done (error detection, for example: the sum of breakdowns by NACE doesn't equal to the total sum), item non-response and miscoding, data imputation (always flagging as imputed).
New variables are derived using special application in IMD SDMS within this sub process, where all arithmetic formulas are described with pseudo code expressions. The responsible statistician for defined survey just needs to press the button to launch calculations. Each arithmetic formula has its priority of performance.
During the description of the questionnaire's version in IMD SDMS all (associated) derived statistical units are defined.
After creation of weights and their import at micro level in IMD SDMS, the system aggregates data from micro-data. Aggregation characteristics and aggregation algorithms (within one version of the defined survey) were described before at step 3.2 or can be described in the sub-process 5.6.
If there are some problems with data aggregation IMD SDMS informs users about the type of problem (for example: aggregation field is not filled with certain statistical units).
The main point of aggregated data - they are aggregated at lowest classification level
(for example: NACE family (4 digit); CPA (6 digit); PRODCOM (8 digit) and etc.). It should be underlined that data analysis for aggregates will be available at lowest classification level as well.
This sub-process provides macro data sets (or output tables), which are used as the input to phase 6 Analyze. All versions of provisional or final macro data sets are stored in IMD SDMS and have attached to them calendar dates, which show when macro data sets were created.
Phase 6 - Analyze
In this phase statistics are produced, analysed and prepared for dissemination.
This sub-process can be divided into two parts: the first part is covered by IMD SDMS and the second one covered by other processing tools.
This sub-process brings together the results of output tables (created after data aggregations) or/and results of production of additional measurements such as indices, trends or seasonally adjusted series, recording of quality characteristics.
IMD SDMS has a possibility to create output tables (within one version of the defined survey). The main point of output tables - they are summarized at highest classification level (for example: summarized data by NACE section "F" - Construction). In that case analytical data for output tables will be available at highest classification level. This case occurs for data, which is published in absolute values.
Others processing tools like as Demetra, Access, SQL and etc., cover another part of this sub-process.
In this sub-process statisticians validate the quality of the outputs produced at micro and at macro levels. The divergence from expectations is analyzed. This sub-process performed by following components: IMD SDMS analytical tools, OLAP (Dealing with OLAP data cube. Using OLAP is it possible to get to micro data from macro data during analyzing the divergence), SQL procedures, SPSS tools.
In this sub-process only IMD SDMS analytical tools will be described.
IMD SDMS analytical tools are foreseen for macro and micro data express analysis. Analytical tools for Microdata allow easy to create any kind of data requests from individual data in different breakdowns, for different periods. These tools provide the export possibilities to XLS or ACCESS for further processing.
Analytical tools for Macrodata allow data requests of aggregated data sets at different levels of aggregation. The results of requests can be exported to XLS or ACCESS for further analysis.
This sub-process covered by IMD SDMS as well. IMD SDMS automatically applies confidentially rules for macro data sets, making checks for primary disclosure.
There are three main primary confidentiality conditions and each of them is marked in provided data set (output tables) group by its own color, therefore it is very handily for statisticians. The confidentiality conditions are described in details in CSP Confidentiality handbook.
Phase 7 - Disseminate
This phase manages the release of the statistical products to customers. This phase deals with checking data and metadata readiness for dissemination.
In this sub-process PC-Axis tools are widely used, as it helps to map data and metadata for putting into dissemination output file system.
Phase 8 - Archive
This phase dealing with micro data and meta data using Data electronic archiving (DEA) system's applications in IMD SDMS.
DEA system performs preparation of statistical documents (surveys) in electronic format for their deposition to the State Archive of Latvia.
This is phase is made up of four sub-processes, which are generally sequential from top to bottom:
This sub-process determines the archiving rules, namely; conditions and the medium of archiving.
The conditions under which data and meta data should be archived:
- data and associated metadata for year three years ago (for example: in 2010 data will be archived for 2007);
- archived data must correspond to the data structure of IMD SDMS and must be matched with corresponding structure by using special IMD SDMS application, if it does not. This is valid in cases when data not stored in IMD SDMD has to be archived;
- archiving process must be carried out from external data in a file or from data which is stored in IMD SDMS
- respondents data which is collected in sub-process 5.3 but without data which has been imputed
The medium of archiving:
- DEA provides a medium by which the archiving document is created (this archiving document includes different kinds of information like as respondent's data; surveys questionnaire; thesaurus - structured content of archiving information and etc.)
Preservation of data and associated metadata:
- data is prepared by DEA in HTML 4.01 using Baltic-1257 coding, data is burned on data carrier (CD-R);
- data preserved on IS server and is available for viewing in IMD SDMS
This sub-process includes the match of data structure for archiving;
This sub-process provides the following activities:
- identifying data and meta data for archiving in line with the rules defined in 8.1
- if necessary formatting those data and metadata for the repository after matching
- transferring data and metadata to the repository;
- create the archiving documents
- verifying that the data and meta data have been successfully archived
Data and associated metadata is disposed off in line with rules in 8.1 and is prepared by DEA in HTML 4.01 using Baltic-1257 coding. Data is burned on data carrier (CD-R). Archivated data is easy retrieved from DEA system. The special flag in IMD SDMS has been done and data is available for viewing in IMD SDMS.
3.3 Metadata relevant to other business processes
Apart from Metadata management process for statistical data processing purposes all others business processes use metadata as well.
The Business processes conditionally can be named as follows:
1. Meta data dissemination processes
2. Quality management process
3. Metadata management process for data gathering purposes
4. Metadata management process for statistical data processing purposes (described in details especially within this case study)
5. Operational processes
6. System metadata
First of all it should be noted that 4.Metadata management process for statistical data processing purposes was described in details in section 3.2 for particular case within IMD SDMS.