3.1 Metadata Classification
Statistics Austria has no "official" classification of metadata. But during the conceptual work for BASIS 2000+, STAT+ and the integrated metadata system IMS a multidimensional approach - similar to Bo Sundgren's proposal in working paper 7 of the METIS 2008 meeting - was worked out.
In this model metadata content is considered to be the focal point. Statistical metadata appear in many different forms: e.g., as title of a table, text in a document (describing, for example, conceptual objects like a survey, a statistical concept or a validation rule), source code statements in a software program, technical attributes of a file, and so on. In principle these metadata items can be seen as instances of a set of object types, which are connected by different kinds of relations (which themselves are part of the metadata).
These different types of metadata can be investigated from varying points of view. The IMS project team differentiated between the following six dimensions:
1. Dimension "Function":
This dimension describes metadata's purpose. Basically, metadata are required for the following reasons:
- for searching and finding statistical information
- in order to interpret statistical data
- in order to access data
- for processing data and producing statistics
- for managing statistical projects
2. Dimension "Statistical Life Cycle":
Statistics production can be described as a process which transforms input data into output data via several steps and using statistical methods. In Statistics Austria, this statistical life cycle is structured in the form of "statistical projects" of different types (surveys, registers, analytical projects or systems). This is described in more detail in section 2.1
3. Dimension "Users":
Statistical metadata are no end in themselves, but are required by different groups of users for varying purposes. "User" must here be understood in a broad sense, comprising not only persons but also IT systems.
We can distinguish roughly between two main groups of users: external ones (which do not belong to the NSI) and internal users. External users are mostly "consumers" of statistics, they may however also be providers of raw data (respondents). Internal users are often both"producers" as well as consumers. Among others, external users may be politicians, scientists, economic enterprises, journalists, private persons or international organisations. In general, the users in these groups differ in the amount of their previous knowledge, the level of detail they wish for in the statistical information they are seeking, and the resources at their command. From the point of view of the amount of metadata they require, one must keep in mind that this may also vary within the relatively heterogeneous groups. Furthermore, the requirements may evolve with time.
4. Dimension "active / passive":
This dimension treats the degree to which metadata play an active role in statistics production, i.e. controlling the process or automating processing steps (e.g. when an electronic questionnaire is generated automatically based on the specification of a survey's questions). With regard to efficient production of statistics one should aim at letting as many active metadata elements as possible be defined directly by the statistical subject matter experts.
5. Dimension "formatted / unformatted":
A distinction can be drawn between formatted and unformatted data. The structure of the former is agreed beforehand (e.g., every record in a file consists of the same sequence of data fields, which in their turn exhibit prearranged characteristics such as data type, length, etc.; or a data file conforms to a predefined XML schema) and thus easily lends itself to automated processing with computers. Unformatted data on the other hand - texts, graphics, voice etc. - are much more difficult and cost more effort to process, especially with regard to IT programs "understanding" their contents. Statistical metadata often occur in unformatted form, e.g. as text in documents.
6. Dimension "manual / automatic":
The criterion by which this dimension classifies metadata is whether they are recorded manually by the persons entrusted with planning and implementing statistical projects, or whether they are created automatically by tools.
Apart from these dimensions, which serve as a means to describe and understand the multilayered topic "statistical metadata", other important aspects must be taken into consideration within the context of metadata management and the development of metadata systems.
When talking about quality in statistics, in most cases the quality of data and statistical results is regarded. In this context, a definition of quality as well as quality criteria have been elaborated, and many NSIs have introduced routines for quality reporting within their institutions.
Compared to data quality, the topic of "metadata quality" has received much less attention. In our opinion, the definition of quality criteria for metadata should become a central task of international working groups in the future.
This topic comprises organizational questions within an NSI (for instance: is there a central metadata unit? If yes: what are its tasks?), but also issues regarding the registration and administration of metadata items (for example access rights, stewardship, life-cycle status, locking of items while they are updated).
In the process of software development metadata play a decisive role. In order to produce
software of high quality and in an economic way, the availability of tools - to support the management of "software metadata" (including the source code of the programs) and to provide services to alleviate the software engineers' work - has long been recognized as necessary. Especially when several programmers are cooperating in a software project, the storage and administration of all information items in a central repository seems indispensable.
The production of statistics exhibits a high degree of similarity to the production of software. However, in statistics the advantages offered by specialized tools and a centralized metadata repository are not yet generally accepted.
Numerous papers point out that the development of a long term strategy forms a necessary and fundamental basis for the step-by-step realization of an integrated metadata system. The elaboration of a "construction plan" as a flexible and extendable architecture is cost-intensive and time consuming, but it is also an investment into a stable fundament which will pay off in the future.
Some important general goals of a metadata strategy are:
- centralization of metadata;
- identification of "atomic" metadata items, their structure and their mutual relations, and decomposition of so far unstructured metadata into such elements;
- storage of these atomic and structured metadata items in databases;
- integration of isolated subsystems into a complete system;
- end-to-end processing.
3.2 Metadata used/created at each phase
In the "4-layer model" metadata are represented as an "infrastructure" layer accompanying the phases of statistical production; in every phase newly produced metadata are stored in the metadata systems and existing metadata are accessed and perhaps re-used. A higher degree of model detail concerning different types of metadata was not attempted, however.
3.3 Metadata relevant to other business processes
For the purpose of cost planning and controlling, SAP software is used.