|Czech Statistical Office||Metadata Case Studies||State Statistical Committee of the Republic of Azerbaijan|
Federal Statistical Office (Destatis), Germany
Total number of staff
(as of January 2007, 27.8% part time employees. Regional (Länder) offices are entirely independent organisations and have their own figures. There is no official report on the total number of people employed in official statistics in Germany.)
Contact person for metadata
Head of Section C304 - "Tools and Standards"
Please use the official template and send your inquiry to the attention of "Referat C304".
0049 (0) 611 / 75 - 1
Table of contents
- 1. Introduction (German Federal Statistical Office)
- 2. Modelling the Information and Processes of a Statistical Organization (German Federal Statistical Office)
- 3. Statistical Metadata in each phase of the Statistical Business Process (German Federal Statistical Office)
- 4. Statistical Metadata Systems (German Federal Statistical Office)
- 5. System and design issues (German Federal Statistical Office)
- 6. Organizational and workplace culture issues (German Federal Statistical Office)
- 7. Lessons learned (German Federal Statistical Office)
Create the sections of your case study
If you have any problems, please send us an email.
1.1 Metadata Strategy
Metadata management has been an issue in the statistical system in Germany for many years. Maybe typical for a federal system, solutions have been found and implemented in isolated areas but they have not been coordinated through a common strategy. The current situation therefore resembles a "bottom-up" approach rather than a unified "top-down" solution.
The experience at Destatis and in the Verbund, however, shows that there is a strong need for a more coherent approach to handling metadata in the future. Several key projects in the Verbund - like standardisation of production or quality management - depend on standardised structures and concepts to understand the content of the different statistical activities in a coherent and uniform way. A metadata strategy would also help to provide a framework for the different projects.
Any future metadata strategy would need to be formulated in accordance with at least the most important stakeholders and it would need to be approved by the responsible committees. Therefore, it is not likely to take shape and become formally adopted in the near future. In the near past there were several projects - independently planned and implemented - that involved a centralised metadata management. The task is to combine the projects in a way that at least the outline of a common metadata strategy starts to emerge.
1.2 Current Situation
There are currently two major projects that involve centralised metadata management.
1. Census metadata
The Census 2011 in Germany is carried out by the Verbund. It is based on a method that combines administrative sources with survey data. To deal with content from so many different sources a strong metadata management is needed. Hence, a working group has been instituted to deal with metadata issues and especially with the development of an IT-system for the management of census metadata. The working group is staffed with methodologists, IT-experts and subject matter statisticians from some Länder offices and Destatis.
The Census 2011 is currently the most important project and certainly the most pressing issue for Destatis in terms of metadata. To deal with the complexity of the project, it has been broken down into several sub-projects for which business cases are being written.
2. SteP - Standardisation of Production
SteP is a joint initiative of the Verbund to standardise production. A major objective of SteP is the design and deployment of generic IT-tools as building blocks of a standardised IT-landscape. Although SteP currently deals predominantly with IT-issues, a stronger involvement of subject matter experts should strengthen its outreach in the future.
SteP is organised around a simple process model that names the basic processes mainly in the collection and processing stages (see fig. 2). There are sub working groups (called "steps") dealing with individual aspects of the statistical value chain (see here). A sub working group for metadata - called "step 12-metadata" has recently been established.
The idea behind this sub working group was to develop a metadata portal. This web portal shall allow users to access the metadata stored in already existing IT-systems. Basically, every system that stores metadata could become part of this project. At first, the portal is only intended for internal users. Since there is no general metadata model that standardises and explains the content of the underlying systems, the results would only be confusing for any outside reader.
In general, SteP has so far been a successful project for Destatis and the Verbund. In several of the most urgent areas, production was streamlined and economies of scale could be exploited. There is now a centralised storage facility for finalised micro data (accessible only to Länder offices) and a data editing tool that can be integrated into existing environments. Apart from the metadata portal, important ongoing projects within SteP include a database for incoming data.
A drawback of SteP is that there is no underlying, generally accepted metadata model guiding the project. While data can be passed on along the production chain, metadata is left behind resulting amongst other things in a redundant storage of metadata. As the standardisation process continues, this could become a problem of greater concern. Internal users might be confused with different IT-systems each using a different structure and a different terminology. The interoperability of the systems might also suffer because a metadata model usually embodies a generalised understanding of the way statistical activities are structured.
The task for step 12-metadata is to find a way to harmonise the different IT-systems in a way that the metadata stored can first be accessed and understood by users and secondly be shared by all IT-systems along the value chain.
Apart from these major projects, there are several other activities that involve centralized metadata management issues at Destatis. There is for example a close cooperation between quality reporting and metadata management since they overlap in many ways.
2. Statistical metadata systems and the statistical business process
2.1 Statistical business process model
There are several business models in use in Destatis and the Verbund. The first is a Destatis business process model prepared by the administration department (fig.1). It highlights supporting and management functions that are not part of the METIS GSBPM-model. While management processes focus on strategic issues, support processes cover the functions needed to maintain operations. The core processes include the management of statistical activities and methodology development.
There is a different model in use in the Verbund that guides the SteP project (fig. 2). It has been approved by the heads of the statistical offices in the Verbund. Unlike the Destatis model, it focuses entirely on the core processes of statistical production.
The two models cover different aspects of the statistical life cycle but neither is comprehensive, leaving out especially design and to a lesser extent dissemination issues. When the focus of metadata management at Destatis shifted towards preparing the Census 2011, there was a need for a more comprehensive model. Since the Census 2011 is a new undertaking, especially the neglect of design issues in both models was felt as a drawback.
For purposes of the Census 2011, the metadata working group for the Census has therefore decided to use an adaptation of the METIS GSBPM-model as a starting point to capture the census metadata (fig. 3). If it is found useful in the Census, this model may be more broadly used. Obviously, the METIS GSBPM-model can be adapted to encompass all the elements that the previous two process models have. This should make it possible to profit from the work that went into these models while elements neglected so far can also be included.
Given that most of the SteP-processes are also covered by the census model, it is possible to draw a comparison between the two models, as presented below (fig. 4). (There is no comparison between the METIS GSBPM-model and our census model presented here since they differ only marginally.)
In the near future a working group of SteP will publish a handbook about the standardization of the statistical production which is entirely based on the GSBPM. In 2011 it is expected that Destatis and the German Verbund will adopt the GSBPM without any changes in the two top levels and use it in several areas.
2.2 Current system(s)
2.2.1. GENESIS (in use)
GENESIS is a cube database used in the Verbund by many statistical offices. It is based on an extensive data and metadata model and handles its metadata internally. First drafts of the system date back to 1994. At that time, GENESIS was intended as a data warehousing solution mainly to store macro data for internal purposes. Although it is also used in this way, its main purpose has become to serve as a dissemination database to internet users (since 2002).
In many ways GENESIS overturned existing habits of disseminating data at Destatis and in the Verbund when it was introduced. The cube model along with the standardised metadata entry forced a new way of thinking onto subject matter statisticians. Constrained by organisational issues - especially coordination in the Verbund - and legacy IT-systems, it often stretched the resources of the subject matter departments and the central coordination unit. Despite the age of the design, it is only now that its full potential is being realised. Especially in combination with the centralised micro data storage build by SteP, it is possible to populate the cubes faster and hence build larger cubes and publish faster. GENESIS is integrated into the web pages of the offices in the Verbund. At Destatis it is linked to the press releases so that interested users can search for additional data.
GENESIS was implemented using a programming language called Natural and a pre-relational database technology named ADABAS. ADABAS was first introduced in 1971 and - with many updates - is still heavily used in legacy software at public institutions in Germany.
GENESIS has several clones, with each office having its own database. There is also a GENESIS clone with nationwide data at a regional level. The GENESIS model itself has over the years proven its worth as a data and metadata model for a dissemination database. One Land office (Rheinland-Pfalz) deployed a new dissemination database a few years ago which is based on the same GENESIS model while using relational technology.
2.2.2. RDC-Metadata system (being populated)
With the establishment of research data centres (RDCs), statistical offices in the Verbund began to realise the need for a database holding metadata that could explain the content of the research data files to the researchers. The decision was made to expand the metadata part of the existing GENESIS-system. The result was a metadata system that contains information especially on the level of individual data files, on the level of statistical activities and on the level of variables.
Each variable ought to be entered only once and is then tied to the data file and thereby to the statistical activity it is used in. To avoid the duplicated entry of variables with different names but similar content, an editorial team reviews each variable individually. This basically follows the same idea that was employed in the GENESIS database.
It is interesting to compare the (meta-) data model of the RDC-Metadata system (essentially an expanded GENESIS model) with the other models like the Neuchâtel model or ISO 11179. In some ways they are similar, but the idea of a conceptual variable or of an ISO 11179 data element scheme does not exist. Therefore, variable definitions have to be harmonised at a very low level. The variable is modelled as an object with a definition and a value domain. Categorical variables have their categories (called value domain items in Neuchâtel speak) as objects of their own. The value domain is not modelled as an object on its own. Therefore, variations in the value domain of a variable necessitate the entry of a new variable. As a result, the number of variables rises and the system today stores 5,600 variables for the micro data files of 33 statistical activities.
Nevertheless, the RDC-Metadata system has been successfully implemented and is popular with researchers using the data centres. Since it is not possible to access the research data files via the internet, any prior information about their content is welcome. The system is not yet fully populated as metadata exists only for 35 of the planned 60 statistical activities.
2.2.3. Output oriented metadata system (intended)
The success of the RDC-Metadata system quickly led to a decision to use the same system not only for the RDC-relevant statistical activities but to apply it across the board. The result is the idea of an "output oriented metadata system". Although a business case does not exist, it could be used to document the metadata of the finalized micro data files. A cost analysis of this project still has to be undertaken, but from what can be said today, it seems unlikely that the original idea of harmonising variables at such a very detailed level can be realized by way of an editorial team. With already 5,600 variables for the metadata of 33 statistical activities in the RDC-system, the figure is likely to rise significantly when variables for up to 390 statistical activities have to be stored. In any case, the number of variables stored in the system will most likely be too high to harmonise the variables by comparing them one to one.
2.2.4. Statistikdatenbank (in use, being reengineered)
The Statistikdatenbank stores metadata for all statistical activities at a very high level. It exists currently in the form of two MS-Access databases. One is used to maintain the central catalogue of all statistical activities (called EVAS - Einheitliches Verzeichnis aller Statistiken) of the Verbund. The second one is used for management purposes, containing basic information on methodology, legal background, etc. The reengineering will combine this information in a single application that will allow accessing and querying the information via the internal web portal of the Verbund. As a result, general information on all statistical activities will be visible to all users in the Verbund.
In the course of its further development, the Statistikdatenbank will become a central hub for the management of statistical activities at Destatis and in the Verbund. Every new statistical activity will first have to be registered in the Statistikdatenbank and is then identifiable by its unique EVAS-code (registration meant as a business process, not necessarily in a strict IT-sense). The Statistikdatenbank can easily be amended and combined with other metadata storages at Destatis that use the same EVAS-catalogue. For example, it is planned to integrate the quality reports directly into this application. Quality reports contain partially overlapping information but are currently stored as single text files written according to a given template. It is conceivable that other EVAS-based systems - like the database used to compile Destatis' Strategy and Programme Plan or internal accounting databases - will also be loosely attached or linked to the Statistikdatenbank.
2.2.5. KlassService (in use, being reengineered)
KlassService is a tool developed by the Bavarian State Office for Statistics and Data Processing. It is used to classify and code answers entered in free text fields in questionnaires. It currently houses only two classifications (the German NACE and PRODCOM versions). Since the administration of standard classifications is under the responsibility of Destatis rather than the Länder, the classifications and the additional thesaurus are maintained by Destatis using a web interface. KlassService has also been declared a standard IT-tool under the SteP guidelines. As such, it is used to support the classifying and coding of responses in many offices of the Verbund.
KlassService was built using an ADABAS database and is now considered a legacy system. Because of rising maintenance costs, the Bavarian State Office expressed the wish to move to relational technology. At the same time, Destatis' classification department was making plans to build a comprehensive classification server. The classification department had previously advised the Turkish National Statistical Institute on the design of such a system.
As a result of these initiatives, a business case was drafted that involved a redesign of the old KlassService in three successive stages. The first stage basically consists of the database itself and basic import functionalities. The succeeding stages will focus amongst other things on the user interfaces. The first stage is being carried out by the Bavarian State Office for Statistics and Data Processing. The later stages will be put out to tender in the Verbund.
The new KlassService will bear little resemblance with the old system. It will be based on the Neuchâtel Terminology, Part I, which will only be slightly altered to fit the relational technology employed. Web service functionalities enable connections to other databases and IT-tools (namely to other metadata systems). The system will also be designed to support multiple language versions of the classifications.
2.2.6. Census metadata system (being drafted)
According to a decision made by the heads of the offices of the Verbund, a separate metadata system has to be developed for the Census 2011. The system will be of modular design and so several drafts for individual business cases have to be written. Some of the applications could possibly remain in use after the census has been completed and - if applicable - be employed in other statistical activities as well. A decision on the implementation will be made in collaboration with the IT-working group for the Census.
Issues of census metadata management include:
2.2.6.a Database for methodology documents and other documentation
It is standard practice among statisticians to deliver most of the documentation in written form. A sophisticated methodology, the need for coordination between many parties involved and a very intense preparation phase lead to an enormous amount of text files being written for the Census 2011. However, there is currently no tool to store such documentation in a structured way in the Verbund. In order not to change existing work practices, the first measure to be taken for the Census 2011 will be a relatively simple document management system. It will be structured according to the Census process model (fig. 3); requiring statisticians to deliver documentation in the form of a text file for each applicable phase on level 2 of the model (see also 3.2.). The documents provided will largely be documentation already existing but structured according to the process model.
2.2.6.b Database for variables, value domains and statistical units
To move the documentation of variables for the Census 2011 from written text files to a more accessible and regularly updated form, a database for variables will have to be realised. The draft for this application is currently being written. It will be based on the Neuchâtel Terminology Model (Part II, Variables and related objects).
2.2.6.c Database for matrix and processes
To fully document the statistical data collected, processed, analysed and disseminated in the 2011 census, the different data holdings ought to be documented. According to the current plan, this will either be an extension of the variable database or an independent system. A separate draft will be written for this part, but it will also be based on the Neuchâtel Terminology Model (Part II).
2.2.6.d Connection to production tools
To realize the potential of metadata and reduce duplicated entries, the metadata system ought to be connected to production tools. A draft will be written to explain the connections between the systems and how a coupling of the different tools can be realized.
Several standard classifications will be used in the census. Since these classifications should be stored in the KlassService database, avoiding duplicated entry, these systems must be linked in some way. A draft will be prepared to explain the connections between the systems.
2.2.7. .BASE - Common IT-Applications for Statistical Surveys (in use)
.BASE (Basis Anwendungen fuer Statistische Erhebungen) is the umbrella name for several IT-tools - developed for the Verbund - to support a standardised e-workflow and forms an important part of the SteP-project. Some of the .BASE tools - notably a data editing tool - are metadata driven and load their metadata from a central storage named "survey database".
The survey database registers every survey in the Verbund. A statistical activity may consist of one or more individual surveys. For every survey, several resources can be uploaded and accessed in the survey data base. Apart from text files and other documentation, several XML-files containing metadata to drive production processes can be stored. These XML-files contain for example registered variables and executable code to drive data editing processes in different IT-environments.
The metadata in the survey database is clearly on a technical level. In the terminology of the Neuchâtel model the variables are on a level lower than the conceptual level. It is obvious that the survey database would therefore provide an almost perfect vehicle to transport conceptual metadata (being stored in classification servers or variable databases) into the production process. This would link conceptual and production metadata and - together with a powerful data warehouse at the end of the statistical value chain (see GENESIS above) - would almost finalise a metadata driven production process (provided other IT-tools would use the survey database as well).
To that end, however, several steps will have to be taken beforehand. The survey database was not designed with international metadata standards in mind. For obvious reasons, the focus of the designers was to connect production tools to the database. Given the myriad ways to design a statistical activity, it is unsurprising that the definition of the term "survey" remains somewhat ambiguous. This is currently not a problem for the existing .BASE tools but it will become more of a problem when the survey database is connected to more production tools and other metadata storages (like classification servers or variable databases). Therefore, an overarching metadata model and a standardised terminology is needed to integrate additional production and metadata systems, to facilitate the interoperability of the SteP-tools and thus to ensure the overall success of the SteP project.
2.3 Costs and Benefits
Metadata management contributes directly to the realisation of major objectives in Destatis' corporate strategy. It enables the further standardization of processes, the harmonisation of statistics and the monitoring of data quality (see corporate strategy). Metadata systems help with the documentation of surveys. To ensure that the public trusts in the data which Destatis and the Verbund produce and to be able to claim that the data has been compiled according to an appropriate methodology, a good documentation is indispensable. With central metadata systems in place, duplicated entry of metadata becomes unnecessary, it will be possible to share information easily, to drive production systems and to keep internal and external users informed about the statistical activities. A metadata model that allows for the correct representation of the metadata of all statistical activities can itself be a powerful tool in the standardisation of business processes and IT-systems because it represents a common structure for all statistical activities.
2.4 Implementation strategy
Since several IT-systems that run on metadata are already in place and given the complexity of the issue, we have decided in favour of a stepwise implementation strategy. The Statistikdatenbank and the new KlassService are the systems that will become operational first while a variable database and a tool for managing textual documentation (both part of the Census 2011-project) are next in line. A detailed project management is in place in the Census project. The development of KlassService is managed by the Bavarian State Office for Statistics and Data Processing. The progress of the Statistikdatenbank is dependent on the resources of Destatis' IT-department.
The major design work on the Census 2011-related systems will have to be finished in the first half of 2010. As the census will be conducted in May 2011, maintenance and helping users with the systems could have become a major task by then. After 2011 the attention might turn to generalising the lessons learned and broaden the metadata management with involvement in SteP gaining in importance.
3. Metadata in each phase of the statistical business process
3.1 Metadata Classification
The RDC-metadata system uses a classification that distinguishes between semantic, technical and administrative metadata. Semantic metadata include definitions of variables and other definitions as well as all kinds of methodological documentation. Technical metadata define metadata on the level of record types. Administrational metadata is mainly information about the responsible persons and institutions.
The RDC-classification reflects the need to classify metadata according to different levels of abstraction. Although the RDC-system does not have a separate conceptual level in the sense of the Neuchâtel-model, the term semantic metadata can be seen as synonymous with conceptual metadata. In the future we might need to supplement this classification with a contextual level functioning as a mediating level between the conceptual and technical levels.
With the broadening of Destatis' approach to metadata and the involvement in the Census 2011, it soon became clear that additional classifications had to be introduced to reflect a stronger focus on the statistical process chain. However, it also became clear that there were endless possibilities to structure and classify metadata. Initial experiments with the proposed CMF-classification were made before we realised that Sundgren (2008)** was right to state that multiple linear classifications of metadata exist and that each of them serves a purpose. Roughly following Sundgren, we decided to classify the metadata by form into structured, semi-structured and unstructured metadata. Structured metadata is metadata that exists in metadata systems being structured according to some information or metadata model. Semi-structured metadata exists in the form of written text in a linear order where each text file is the instance of some given template. Typical examples of this kind of metadata are quality reports (or this case study). Unstructured metadata basically consists of text files (methodological documents, etc.) that are structured only on the basis of the author's needs and taste. This classification works fairly well in the census, where most of the metadata is of the unstructured kind.
In addition to using proper classifications*** we also distinguish metadata according to user groups and according to attachment objects (like statistical activity or statistical activity instance). These distinctions do not constitute classifications in the strict sense, because - as of yet - we have neither an exhaustive list of user groups nor an exhaustive list of attachment objects (the latter being the same as an overarching exhaustive metadata model). However, we do classify metadata according to the processes that use or produce these metadata thereby using the process model as a classification.
Apart from classification, the terms quality metadata and production metadata are used in the office. Quality metadata refers to after the fact interpretation of metadata and applies to all metadata that is deemed important for evaluating data quality. Since such an evaluation must be based on existing metadata (frequently called documentation in this context), the degree to which such metadata exists is itself an important quality indicator and part of quality metadata. Production metadata is a term often heard in connection with software development indicating metadata used to execute and control (sub-)-processes in the production of statistical data.
** Sundgren, Bo (2008): Classifications of Statistical Metadata. Paper presented at the Joint UNECE/Eurostat/OECD work session on statistical metadata (METIS), Luxembourg, 9-11 April 2008.
*** Classification meant as a list of mutually exclusive categories that exhaustively classifies each object within its scope according to some explicit or implicit criteria
3.2 Metadata used/created at each phase
So far, no process model has yet been used at Destatis to guide the collection of metadata across all statistical activities. Within the census, however, our adapted version of the METIS GSBPM-model will be used in this way (see 2.1). For each sub process we have established a set of metadata objects for documentation. Each documentation object can be structured, semi-structured or unstructured. Variables (as a general concept, including all object types in the Neuchâtel model), statistical units and rules for generating variables are seen as structured documentation objects. Other objects are essentially text documents, to be delivered as .pdf, word or excel files. So far, there are 41 of these textual documentation objects. Some of them result relatively straightforward from their respective processes. This is the case with drafts for new statistical laws (1.4 in the process model, there are individual laws for most statistical activities), business cases for IT-systems (1.4), technical specifications for IT-systems from the client's side (1.5) and technical specifications (plus handbooks) for IT-systems from the developer's side (2.1). In other cases, more general documentation objects were requested, like "description of output" which could be any document detailing the planned products for the census (1.1). Important aspects for the assessment of data quality were covered by documentation objects on sub-processes coding (4.3), data editing (4.4) and imputing missing values (4.5), with one object elaborating on the intended procedure of the respective processes and one consisting of after the fact documentation (including quantitative information like number of edited records).
3.3 Metadata relevant to other business processes
In general, all metadata collected along the core process chain is also relevant to other business processes, albeit often on a more condensed level. The Destatis process model (featured in 2.1 as the first model) details these other business processes that are not always part of the METIS GSBPM-model.
The processes that need more detailed metadata are "management of statistics (statistical activities)", "methodology development" and (not mentioned) "quality and metadata management". Apart from the core processes, management and support processes also need metadata, although mostly either in a very general form or very detailed according to specific requests. To deal with this issue, the Statistikdatenbank will be made available to more users with the possibility to link to budget and accounting systems or other resource planning software.
4. System and design issues
4.1 IT Architecture
Given that the Verbund - explicitly and implicitly - follows a step by step approach, no single, all-encompassing metadata system exists or will exist in the future. Instead the metadata architecture will consist of different independent systems. Each system will be developed on its own and therefore have its own IT-architecture both in terms of data model and business architecture.
To allow internal and (at a later stage) external users to access the metadata stored in the existing metadata repositories (including various applications that in one way or another store metadata) a web portal will be set up (see 1.2 "metadata portal"). In order to connect to the portal, each of the participating applications will need to have web service functions.
However, when an assessment of prospective underlying systems began, it soon became clear that establishing a technical connection between the systems and the portal was not the decisive issue. Instead it turned out that the systems had different ideas on how to structure metadata. Not only were the formats and data models unique, each system also had its own terminology. Sometimes, the same term could mean different things in different systems. While this was acceptable or even desired with respect to the different tasks of the systems, it surely complicates interaction. Hence, it soon became clear that no meaningful presentation of the content could be given without a shared metadata model and a common terminology to name and identify the metadata.
In order to arrive at such an overarching model several international metadata models were reviewed. However, with the Census 2011 having become the most important issue, the insights gained will first find application in the metadata system for the census. Within step 12 - metadata (see 1.), a possible overarching model as a solution for the interaction of the different systems still has to be discussed. Meanwhile, a first step towards a metadata portal will be to make Statistikdatenbank accessible via web services to internal users of the Verbund.
With respect to the census, the work began with choosing a metadata model that would fit the requirements of the census. The census is based on existing register information that will be combined with multiple surveys for data not covered by the registers. As a result, there are different data sources to be described, which often contain slightly different variables. Therefore, the task of a comprehensive metadata management is to track slight changes - for example in the meaning of terms or in the coding of enumerated variables - while reusing existing information to a large extent.
After reviewing the Neuchâtel-model, the responsible working group decided that the Neuchâtel model was the model that best fitted the requirements. It seems that the object types in the model represent nearly all conceivable meta-information needed to describe data in different settings. The Neuchâtel model therefore offers the possibility to integrate our existing systems (mainly .BASE and GENESIS) as well as systems that are currently being realised (KlassService and Statistikdatenbank). A high level overview exists that explains the conceptual connections between the systems (see below). It is based on the METASTAT@FSO design developed by the Swiss Federal Statistical Office which implements the Neuchâtel terminology to a large extent.
KlassService and Statistikdatenbank are not part of the census funded systems but will play a role in the management of the census metadata and are therefore part of the model. Not represented is a tool for document management that is being developed as part of the census. The centrepiece of the model is a variable server that is currently being drafted (see also 2.2).
4.2 Metadata Management Tools
Links between data and metadata exist in the .BASE-system. As part of this system, there is a tool for defining data editing rules on the basis of pre-defined metadata. These variables and their value domains are stored in a separate repository (survey database). Reuse of metadata is encouraged by allowing users to share their variables. Before a user can delete or change a variable, all other users of this variable are asked to agree.
The GENESIS cube database also features a metadata repository. In contrast to the .BASE-system, GENESIS stores its metadata internally. Before creating a cube, the cube variables first have to be defined. Reuse of existing metadata is facilitated by an editorial team that checks each variable individually. There are currently about 1.300 active cube variables in the system at Destatis (for a total of 180 statistical activities) - a figure that has proven to be manageable.
4.3 Standards and formats
So far, mainly internal and national standards are applied. The .BASE-system runs on several nationally designed XML-formats. DatML/SDF (Survey Definition Format) describes the survey (esp. the variables). DatML/EDT holds the metadata that defines the data editing rules. DatML/ASK is metadata to set up electronic questionnaires.
GENESIS has its own database model, which can be seen as a national standard since most Länder offices either use the GENESIS database or a database based on the model. GENESIS is used to send data and metadata to Eurostat with the SDMX-standard.
The reengineered KlassService will be based on the Neuchâtel model, Part I.
There is no standard format used in the Statistikdatenbank. Nevertheless, the understanding of the term statistical activity ("Statistik" in German) is much the same as in the Neuchâtel model. Therefore, Statistikdatenbank can interact with other systems within a distributed metadata systems that allocates different functions to different systems according to the Neuchâtel model.
4.4 Version control and revisions
In theory, there seem to be two different ways to versionise metadata. One is to attach validity periods (valid from, valid until) to metadata objects. This is done nearly in all databases. The other seems to be to create additional object types for the versions of a metadata object type. In this way, there exists an object type for general information on an object plus another object type that captures a list of versions which record the changes to an instance of the first object type over time.
Validity periods are used in the .BASE-system, for example to identify active surveys. Inactive surveys are disposed of, if not archived (archiving functionality planned).
Instantiations are used in the KlassService where - following the Neuchâtel model - the classification versions are an object type of their own. General information (that does not change over time) about classifications is captured by the object type classification.
Instantiations have also been introduced in the RDC-metadata system where each statistical activity has a list of statistical activity instances capturing the individual features of each successive survey.
4.5 Outsourcing versus in-house development
There is a combination of in-house development, Verbund development and outsourcing (see list below).
- GENESIS has been developed as a Verbund project, with the programming work being shared by several offices.
- The RDC-metadata system and the projected output oriented metadata system are spin-offs from GENESIS.
- .BASE was an outsourced development with substantial input to the business case by Destatis' IT-department.
- KlassService redesign is a Verbund project. It is being developed by the Bavarian State Office for Statistics and Data Processing as was the original KlassService.
- Statistikdatenbank is a Destatis project carried out as an in-house development.
- All metadata systems to be developed for the census will be outsourced under the general rules laid out for the IT-development of the census.
4.6 Sharing software components of tools
A major problem in software sharing is language. In most systems the user interfaces are in German only. Maybe more important, few of our systems allow content to be stored in more than one language. Two exceptions are the redesigned KlassService, which supports n-languages, and GENESIS, which supports English content as well as German.
As of yet, there has not been any case of software sharing between Destatis or the Verbund and any external partner. It is not impossible, however. Most of the IT-systems are either owned by Destatis or the Verbund. Any prospective effort to share IT-technology will have to be reviewed by the responsible committees.
4.7 Additional materials / Links to additional materials such as data models.
On request (mostly in German).
5. Organizational and workplace culture issues
5.1 Overview of roles and responsibilities
A central metadata unit has been established. Currently the central metadata unit works on the coordination of the different parts in the organisation(s) that deal with metadata.
Metadata used by Destatis' output database (GENESIS) is coordinated by the dissemination department, where an editorial team reviews the cube descriptions. Similar solutions are used at the Länder offices.
Metadata that describes public and scientific use files is stored in the RDC-metadata system. The RDC-system is jointly maintained by Destatis' RDC and the RDCs of the Länder offices.
Production metadata is being cared for by the subject matter departments. Different roles exist in the layout of the .BASE-system.
5.2 Metadata management team
The central metadata unit currently consists of 3 people and was established in 2007. As mentioned, there are two working groups in the Verbund, which are responsible for coordinating metadata related activities in the Verbund. One is step 12 - metadata as part of SteP (introduced 2008). This group is responsible for the general coordination of metadata. The other is the Census working group responsible for dealing with census issues, which started work at the end of 2007.
Details of roles and responsibilities for the new systems are still being worked on.
5.3 Training and knowledge management
There are no courses entirely dedicated to metadata management. Metadata issues are covered by courses on the GENESIS cube database. The various RDCs run courses to introduce internal users to the RDC-metadata system. With the beginning of 2009 this work has been passed on to Destatis' metadata unit.
5.4 Partnerships and cooperation
Destatis participates in international METIS meetings. In addition, there has been knowledge sharing with other statistical offices. We consulted the Swiss Federal Statistical Office on general issues of metadata management (esp. implementation of the Neuchâtel model and its application in the census). Statistics Norway participated in a joint workshop with Destatis and the Bavarian State Office for Statistics and Data Processing to discuss the design of classification databases. Issues for discussion were our KlassService redesign and plans by Statistics Norway to redesign their classification server.
6. Lessons learned
- Metadata management is a communication challenge. We found two issues were particularly difficult to communicate:
- Metadata management is tricky. Statistical data is inherently volatile. For any given data, an endless number of transformations are possible producing an endless amount of metadata. With the distribution of modern IT-systems there is hardly any limit to producing endless variations of the same dataset.
- Metadata can be more than just documentation. The same information that is used to transform (produce) data can be used to document it and vice versa.
- As we are faced with multiple stakeholders, several isolated decisions taken by governing committees and a variety of IT-systems in place, it would probably be useful to develop a metadata strategy. Such a strategy might help the organisation to focus on important projects and provide a coordinated approach ensuring that systems are able to interact. Distributing the energy of an organisation across too many unrelated tasks easily drains away resources without delivering satisfactory results. Drafting such a strategy, however, also consumes resources and requires a deeper understanding of the problems.
- The advantages and disadvantages of a metadata model can often only be properly evaluated once an IT-system is in place. It is therefore important to learn from evaluations of existing systems.
- Considerable effort went into formulating metadata models. Having evaluated some of them, we feel that the existing models do bear some similarities. A perfect model may not exist, especially since the resulting implementation usually involves some compromise. No database can be endlessly complex. But on a more conceptual level there seems to be some convergence. Indeed there might even be a structure inherent to the metadata of (official) statistics. Thus, the quest for the "real" metadata model might be less a matter of design than of discovery.
- In a federal system, national coordination usually requires a lot of resources from all partners in the system. Understandably, international cooperation is then often seen as being of lesser importance. Despite this, international cooperation has substantially helped the metadata team at Destatis to understand the subject of metadata management. The development of IT-systems consumes a lot of resources. We feel it helps to build on existing international knowledge and that it minimises risks and maximises return on investment.