2.1 Statistical business process model
Within the framework of the STAT+ project, a model of the statistical life cycle (called the "4-layer-model" because of the four data systems it defines) was elaborated at Statistics Austria in 2002. The model distinguishes between the following types of statistical projects: surveys, registers and analytical systems.
Surveys are the most "typical" and most commonly occurring form of statistical projects at Statistics Austria. One can differentiate between primary surveys (in which the collection of raw data is one of the steps of the process) and secondary surveys (which process data which already exist and often were collected for non-statistical purposes).
There are also mixed types, e.g. surveys in which data from secondary sources are used to augment the data collected by questionnaire. Some surveys are only undertaken once, others are repeated at regular or irregular intervals - although the sets of variables collected in each repetition do not have to be identical. It is therefore useful to further subdivide the survey structure: each survey consists of one or more survey versions, and each survey version consists of one or more survey instances, i.e. concrete executions. E.g., in the case of a survey with monthly periodicity the data collection of each year might be seen as a new survey version with twelve survey instances.
In contrast to the data of a survey instance, which pertain to a certain reference date or period, registers are usually updated continuously. Maintaining a register is thus a core process that is typical for a statistical project of type "register" but unknown for projects of type "survey". Another fundamental difference is that register data are used as resources for workflows in other statistical projects, e.g. when drawing samples, for addressing, registration of incoming questionnaires and administration of reminders. Commonly, specific (database) applications are developed to carry out these functions.
Analytical projects and systems (as, for instance, national accounts) characteristically do not collect raw data on specific observation units, but use data from other statistical projects and evaluate them or combine them into a coherent, integrated model.
Data which form the input to a statistical project (or which are collected in an early phase of processing, in the case of primary surveys) often pertain to individual observation units, e.g. individual persons, households, enterprises, events, etc., and are termed "microdata". However, the input may also consist of macrodata, i.e. data pertaining to collectives. In addition to these, metadata also enter into a statistical project and form an important resource for the steps carried out in its processing.
The output of a statistical project consists predominantly of macrodata - cross-classified tables, multidimensional data cubes and time series being the most important categories - and metadata. More rarely, (anonymized) sets of microdata may be produced. Macrodata and certain related metadata are often combined into an "information object" (e.g., a press release consisting of a table and descriptive text). Such information objects may also be composites of smaller information objects (as with a printed publication containing several tables and descriptive metadata - e.g., analytical texts).
The Statistics Austria model of the statistical life cycle distinguishes between the following phases in the production of statistical information (this description applies to statistical projects of the type "survey". Registers exhibit different core processes - creation, maintenance, and use of the register -, although the contents of a register may also form the basis for production of statistics and information, which can be identified with the relevant phases of a survey. The line between surveys and analytical projects is not always clearly defined - in fact one could certainly argue that analytical projects are a special type of statistical survey creating statistical information from input data, the difference being mostly that methods are applied which differ from those used in "typical" surveys.):
Phase 1: Planning, design and system development
The survey is set up in this phase. Given specific requirements (e.g. EU regulations) and the information needs of internal and external parties (e.g., statistics users in government and the economy), the survey must be prepared to satisfy these as best possible while simultaneously minimizing the effort of the statistics producers and the burden on the data providers.
Output of this phase are metadata of various kinds - e.g., description of the survey's goals; description of the characteristics to be surveyed; definitions; value domains; classifications; questionnaires and comments explaining them; list of validation rules, etc. The metadata created in a survey may form the input to other phases or other surveys and be reused there.
The development of tools for actually conducting of the survey (e.g., electronic questionnaires, editing software, programs for checking consistency and plausibility) is also a component of the first phase.
Phase 2: Data production
Whereas the decisions taken in the design phase and the metadata and tools which are created there apply to the whole survey or at least a survey version, the focus of the activities undertaken in the following phases lies mostly on the current survey instance (excepting activities in which data from more than one survey instance are processed, as in the creation of time series).
Data production can be subdivided into three sub-phases:
- In pre-production activities such as drawing the sample, printing and posting paper questionnaires, loading Web questionnaires with respondent-specific initial data, etc. are undertaken.
- The actual survey/measurement/observation of the statistical raw data is termed core production. This includes conducting interviews, filling in paper or electronic questionnaires, registering and roughly checking questionnaires which arrive, mailing reminders, data entry from paper questionnaires, etc. In secondary surveys, this sub-phase includes acquisition of the secondary data and, if necessary, reformatting or recoding it. The collected data are stored in the so-called Raw Data System (RDS).
- Post-production includes all activities necessary to improve the quality of the raw data. Among these activities are validation and consistency checks, examination and correction of dubious information, and imputation of missing values. The results of this phase are the "authentic" data (ADS: Authentic Data System); of these several versions may exist, especially in complex and voluminous surveys (e.g., preliminary version at a certain cut-off date and final version at a later date).
A large part of the metadata created during the design phase enters the second phase as input. As also in later phases, however, new metadata also are produced (e.g., the answer rate, which is an attribute of the survey instance).
Phase 3: Statistics production
In this phase, the contents of the Authentic Data System (consisting mainly of microdata) are processed further. To do this, data from other surveys may occasionally be accessed. Some of the processing steps undertaken are aggregation into macrodata, calculation of statistical measures and indices, diverse methods for increasing quality and comparability of statistical information (e.g., seasonal adjustments), and creation of time series. The results are data sets which are stored in the Statistical Data System (SDS) and are at the disposal of internal, often also external users. The SDS primarily contains multi-dimensional data cubes, although anonymised sets of microdata may occasionally also be created in this phase.
In part, the transformations which are carried out here have already been planned in the design phase and are applied to the data of each survey instance. Partly, however, ad hoc analyses may also take place, which use the existing data material in ways not foreseen whenthe survey was planned. This underlines the importance of comprehensive and easily accessible documentation of all a survey's design decisions, of the data sets and of the transformation processes (in whatever phase of the statistical life cycle they may be created or executed).
Phase 4: Information production
In this last phase, "information objects" such as tables, charts, articles, press releases etc. are created from the data stored in the ADS and the SDS and their metadata and disseminated via various media (internet, print publications, etc.).
The following figure presents the phases described above, the data systems, the registers and the meta-information systems. On the one hand, the latter provide input and various services for the activities carried out during the statistical life cycle, on the other they also accept the metadata created as output from each phase. Thus metadata systems and registers form an infrastructure layer accompanying the whole production process. The data systems are drawn as broader than the "process arrows" in order to point out that they contain data from various surveys and that an individual phase of a survey may accept input data from more than one statistical project.
Figure 4: "4-layer-model"
In actual fact the workflows are of course not quite as linear as the figure might suggest; on the contrary, complex control flows (branches, loops) often occur. Moreover, events in later phases may have retroactive effects on the survey's design and lead to adjustment of the current or future survey instances (e.g., changes to the validation rules).
Phases 1 to 7 of the Generic Statistical Business Process Model ("Specify Needs", "Design", "Build", "Collect", "Process", "Analyse" and "Disseminate") can be mapped to the "4-layer-model". The phase "Evaluation" has been considered as an ongoing process, but is not explicitly mentioned in the model. The over-arching process of "Metadata Management" is represented by the "Metadata systems and registers infrastructure layer". "Archiving" and "Quality Management" are not part of the model.
The Generic Statistical Business Process Model incorporates many more details than the 4-layer-model. Therefore the GSBPM (with the addition of the four data systems) would appear to be appropriate for use in future metadata projects.
2.2 Current system(s)
ISIS (short for Integrated Statistical Information System) is a statistical output database which was already developed in the early 1970s and has been consistently maintained and developed further since then. It contains thousands of multi-dimensional data cubes as well as metadata of various kinds (e.g., short descriptions of the data cubes and the underlying surveys; keywords and a hierarchically structured topic tree are furnished for data searching) and implements a large part of the Statistical Data System SDS in the life cycle model. Although ISIS is still very modern from the point of view of the conceptual design of its contents, the software itself has reached the end of its life span, as only one programmer now still possesses sufficient technical know-how to maintain the mainframe Assembler and PL/I programs. Because of this, a successor system (ISIS New) is currently being developed on the basis of the Australian company Space-Time Research's SuperSTAR product range.
Currently the project "e-Quest New" is running with the goal of replacing the Visual Basic components by a Java-based solution. Simultaneously, better integration of the stand-alone and Web questionnaire subsystems is being aimed for.
- Publication Database:
Using document management software from the company Stellent (which since has been acquired by Oracle) the publication data system PDS was created during the last few years. This stores all publications (i.e., documents of various types, from tables over print publications and press releases to the so-called standard documentations) together with metadata relating to the documents. Since the Web re-launch on June 1st 2007, Stellent is also utilized as a Web content management system. The subject matter experts now create Web pages in the form of standardized Word documents which are automatically converted to HTML and copied to the correct position in Statistics Austria's website on the basis of associated metadata (in particular a hierarchical topic and navigation structure). The navigation structure is also used for generating links to related documents with data and metadata. The online directory of print publications (many of which can be downloaded free of charge as PDF files) was also implemented in the Stellent system.
- Classification Database:
In 2006 the Classification Database KDB was released. This allows Web access to almost 20 voluminous classifications such as PRODCOM, NACE, COICOP, SITC and CPA, including comments and correspondences. More than one version is available for several classifications.
Up to now an application for interactive editing and processing of classifications has not been developed.
- Statistical Table Format STF:
STF is an XML specification which permits cross-classified tables to be stored together with extensive metadata in a hardware- and application-independent format - for long-term storage, among other uses. Converters from STF to Excel and HTML and from Excel tables to STF are supplied. When Excel tables are checked into the Stellent publication database, they are automatically converted to STF format. ISIS query results can also be stored in STF format.
- Standard documentations:
The standard documentations - which can be downloaded as PDF documents over the Web - serve as the most important source of metadata about statistical projects and the quality of the statistical results they produce. The documents exhibit a standardised chapter structure and hitherto describe more than 100 statistical projects or survey versions, in part in great detail (they number between 8 and 100 pages; in many cases further documents are provided as attachments which can be accessed via hyperlinks in the text). Among other things they do carry the disadvantage of usually being written and made available to the statistics' users in a separate and additional work step after the fact, although they contain many documentation elements which come into existence in the early phases of planning and preparing the statistical project. Another weak point is that there are no quantitative quality-indicators included.
This system was implemented through a Word template. Every manager of a statistical project is obliged to use this template when compiling a standard documentation.
The main headlines are the following:
- Important Hints
- General Information
- Statistical Concepts, Methods
- Production of Statistics, Processing, Quality Assurance
- Publication (Accessibility)
Every chapter is divided into subsections which are more or less standardized.
- Release calendar:
The calendar of planned releases is available at http://www.statistik.at/web_de/ueber_uns/ veroeffentlichungstermine/index.html. It consists of two PDF-files which are updated on a regular basis (in the first one releases are sorted by date, in the second by statistical domain).
From the same Web address, a file with information on the dates of data transmissions to Eurostat can be downloaded. There is also a link to the advance release calendar at the SDDS site of the IMF.
The planned press releases of the upcoming week are published at http://www.statistik.at/web_de/presse/presseservice/index.html
- Database of administrative data:
This is an MS-Access application available only to internal users which contains information about administrative data sources.
2.3 Costs and Benefits
Metadata systems form a fundamental information infrastructure for the production of statistics. More than 15 years ago, Bo Sundgren wrote the following about this topic:
"Statistical metainformation systems (...) exhibit some characteristics, which are typical for infrastructures:
- They require collective commitment and relatively large investments, which (at least initially) have to be financed by the organization as a whole.
- They have to be designed on the basis of partially unknown needs, some of which require "intelligent guesses" about the future.
- They have to be planned for a wide range of usages and users, some of whom may have conflicting needs.
- Once they exist, the marginal cost of using them is relatively low, at least in comparison with the initial investment."
(Bo Sundgren, Organizing the Metainformation Systems of a Statistical Office, Statistics Sweden 1992)
When metadata can be utilized to standardize and automate production processes ("active metadata"; see section 3.1), the costs for the development of metadata systems (which in many cases are quite substantial) are balanced by prospective long term monetary benefits, which in the long run may result in major cost savings. One example of this is Statistics Austria's metadata driven electronic questionnaire system e-Quest. Compared to the development of a tailor-made electronic questionnaire for a single survey, its initial development costs were inevitably higher. But now e-Quest facilitates the cost-effective creation of electronic questionnaires. By using the system repeatedly within many different statistical projects, the break-even-point was reached quickly.
The situation in the case of developing systems for the collection and administration of passive metadata is, however, quite different. Passive metadata are an integral component of statistical information. Their availability and easy accessibility contribute to the quality of statistical products, but in many cases do not result in cost reductions (they may even increase the work load of subject matter statisticians). Opportunity costs caused by the non-existence of centralized end-to-end metadata systems are rarely found in accounting systems. Thus high investments are accompanied "only" by a gradual gain in quality (which may not even be recognized by all user groups). Under these circumstances it is understandable that in times of economic crisis the willingness to invest in metadata projects is not high.
The concept of "high-quality statistics" is a dynamic one. The needs and requirements of users are changing and will probably increase in the future, e.g. with regard to harmonization of statistics or the linkage of data with relevant metadata items (respectively linkage of metadata items with related metadata items), so that they can be accessed at the push of a button. If metadata are stored in the continuous text of bulky documents, these new requirements cannot be met. The management of metadata in an "atomic" and structured form, however, is a challenge with respect to both financial resources and personnel.
The fundamental principles of metadata management, which have been defined by experts during recent years (and which can be found, for example, in part A of the Common Metadata Framework) will become more and more commonly accepted standards and state of the art for the production and dissemination of statistical information.
The task of implementing these standards can certainly not be carried out at short notice. In this respect, it is not easy to answer the question whether to continue building isolated metadata systems whenever the need for one specific system arises, or whether to strive for an integrated system based on a global architecture. The first approach is certainly less expensive in the short run and produces quicker results, but in the long term it will cause quite substantial "repair" costs.
2.4 Implementation strategy
Similar to the BASIS 2000+ concept, a modular implementation approach was a major design principle of the IMS. In order to minimize the complexity of the complete system, the individual components (subsystems) should be able to work independently, communicating with each other and the central "Registry" by means of a web service and program interface layer. Thus - considering the limited resources - stepwise realization and gradual commissioning and expansion of the IMS (in the sense of "evolution instead of revolution") should be facilitated.
Regarding the integration of previously existing legacy systems into the IMS, several options are possible. A very simple form of coupling can be realized by manually registering information objects (for example a classification from the Classification Database) in the IMS Registry. A tighter and more sophisticated integration will require some programming effort, so that a legacy system can communicate with other components of the IMS via web services.