Login required to access the wiki. Please register to create your login credentials We apologize for any inconvenience this may cause, but please note that this step is necessary to protect your privacy and ensure a safer browsing experience. Thank you for your cooperation. Documents available for download: GAMSO , GSBPM , GSIM |
Contact person* | |
---|---|
Job title | Senior Advisor Metadata |
Telephone | +46 8 506 94352 |
Statistical Information Model
Statistics Sweden has created a national version of an overview model of GSIM. Some areas are not covered.
Detailed diagram:
Background
Statistics Sweden has a vision in line with the HLG and Eurostat 2020 vision. It focuses on creating a production environment based on a high level of standardization. The Swedish national version of GSBPM is the base for producing statistics as well as the organisation providing a basis for well-defined standardized processes and activities using standardised methods and tools.
Following the vision, information is stored in a coordinated and effective manner in a well-structured data warehouse throughout the production. The links between activities and data are supported by a common platform, where the standardized tools and services needed to carry out an activity are available via a central metadata repository, which contains the information needed to describe the data warehouse, including the tools, and to control and run the processes. These processes are continually evaluated by the process metrics created and utilized in order to improve them. Coordination at the statistical object level is supported by a registry system, in which base registers[1] interact and work well together with other statistical activities.
In the vision, all parts of the system are well integrated and cooperate to drive the business processes (production) forward in an effective, well-documented and standardized manner in which responsibilities for different parts are made clear.
The statistical data warehouse vision:
Production systems do not maintain their own local metadata repositories, but all metadata associated with the statistical production process are stored in a central Metadata Repository. Common tools and methods related to relevant parts of statistical production process are means to achieve reusability and coherence.
A central data warehouse requires standardized rules governing the physical storage, such as formats and lengths. There must also be rules governing how to handle updates, ownership and permissions / security.
Other key factors to achieve the objectives include relevant expertise and management support.
Sweden has a distributed statistical system with 27 authorities responsible for producing official statistics where Statistics Sweden is the national coordinator. In the future the vision should cover the entire Swedish statistical system, using a common shared model for a metadata repository.
The vision is in principal shared by several other statistical organizations in the world, and is in line with the HLG vision[1]. It follows the HLG models GSIM (Generic Statistical Information Model), which is a common information model for statistical production, and GSBPM (Generic Statistical Business Process Model). Both are key components in modelling the central Metadata Repository. The model is in principal a Swedish adaptation of GSIM, on which the model is closely based.
The model is structured into the same four groups used by GSIM to structure the information objects. The areas are business, exchange, concepts and structures.
The present situation
An analysis of the current metadata situation at Statistics Sweden shows that the distance to achieving the vision is considerable. Currently, it is not possible to monitor, assess or control the production via the business processes. The metadata environment is split up storing metadata for different purposes separately, often closely coupled to the tools that support a particular process. Many tailor-made production systems include the required metadata, but these are stored separately in the respective production systems with no links to central systems. There are metadata in common production tools such as Triton[1] and the Statistical Databases (PC-Axis based), but they lack links to the metadata used in the following or previous processes. The MetaPlus system was originally developed primarily to be Statistics Sweden’s tool to document final observation registers (mainly microdata). It is nonetheless the metadata system at Statistics Sweden that is closest to fulfilling the principles of a central metadata repository.
MetaPlus contains central common metadata such as classifications, variable definitions and their value domains, linked to objects and populations. The documentation in MetaPlus is fairly complete, and up to date, but in many cases it has extensive quality deficiencies.
Triton's metadata system was developed specifically to support the information handled by Triton. There are no links to metadata that relate to subsequent processes. Other collection tools, such as the centralized scanning system and SIV, use their own metadata systems covering their particular purposes. They lack connections to other metadata systems.
The metadata system used in the statistical databases is the oldest metadata system still use. It uses PC-Axis and is tied to the structure of the statistical databases and has no links to other metadata.
The present situation:
The vision of a central metadata repository uses the term metadata in a broad sense. The repository includes information about:
- connections to the statistical business process in its various stages
- design of the statistical programs and their detailed process steps
- process rules
- statistical variables and other variables needed to support statistics production
- populations and unit types associated with them
- value domains and classifications
- statistical program cycles
- questions
- services available
- reference metadata (documentation) and quality descriptions
- thesaurus
- connection to physical data storage
- statistical products
The central metadata repository supports process metrics, and stores the related process metadata
The central Metadata Repository is responsible for the quality, structural rules, and consistency of the metadata. It ensures that redundancy checks and versioning is carried out. Read-only copies of metadata may occur locally in production systems, e.g. for performance reasons.
The central Metadata Repository does not include:
- data and object instances
- process metrics
- services (executable IT services and manual procedures (checklists, procedure descriptions, tutorials, etc.)
- other master data that is common to Statistics Sweden such as personnel and organization
The model of the central metadata repository and the Data warehouse and register system vision
The image below shows a schematic view of the main parts of the central Metadata Repository and how it is linked to data and to the statistical business process model.
The model of the central metadata repository and the Data warehouse and register system vision:
PICTURE
The central Metadata Repository includes reusable components, which can be connected to different parts of the statistical production process. They should be described in such a way that they support and control the production. This means that they need to be systematically structured in order to be machine-readable (automated).
The metadata repository does not need to be physically located to one single place or have one single common interface, but it essential that the various parts fit together and that users do not have to re-enter information, e.g. value domains, several times ‒ reuse is a central feature.
Currently metadata are available that actively support the processes Collect (in Triton) and Disseminate (in the statistical databases). For Process and Analyse there is currently no cohesive active metadata repository to support the production. However, the functionality available in MetaPlus makes it possible to use the metadata in an active way, but since MetaPlus is not comprehensive, there is a need to supplement that part of the metadata layer.
Process steps
It is essential that the metadata model includes a link to the statistical business process. This allows the model to provide a complete basis for a process based organisation that defines a set of common, organisation-wide process steps. Any statistical program will be able to select its relevant process steps with support from the metadata repository.
Process steps exist at various levels. The highest level is the statistical business process model as a whole. It comprises Process steps 1-8, and then an arbitrary number of levels, until the activity level is reached.
The generic business process steps are the basis for maintaining a common set of status codes throughout the statistical production process. A general list of business process steps (down to activities) does not exist today, and needs to be created to be made available in a central metadata repository.
No process metricsare stored the central metadata repository, only the process metadata. Process metadata are necessary to evaluate a specific business process activity, for example, in statistical products. An example of this could be to measure the influx for Statistics Sweden's voluntary individual surveys in 2013 which used the SIV collection tool. Another example could be the relative size of the over-coverage (in the sample) for all enterprise surveys which cover section G according to the Swedish activity classification (SNI 2007).
Attributes - Status codes
Various objects in a central metadata repository require status codes to efficiently support the construction and use of systems and system components for all the parts of the statistical business process. What status codes are required, and for which objects, to provide such support needs to be investigated and discussed further.
Business process
Business process is important in the model since the other objects are directly or indirectly connected via business process. It is the glue that ties the different parts of the metadata model to the model of the business process.
At the present, the statistical production process comprises many different Business processes. These are not coordinated and do not cover the whole statistical business process production cycle. Statistical program, Statistical program cycle and Collection cycle are implemented in Triton, covering the Collect and parts of the Analyse processes. Disseminate has a publishing cycle and statistical product, that has production cycle. The relationship between these rounds is not unambiguous. The MetaPlus metadata system documents final observation registers (usually microdata) using a hierarchical structure (Register, Register variant and Register Version). National accounts use the terms Computational cycle and Version to describe their Business Processes. For parts of Process and Analyse there are no Business Processes implemented in any standard tool at Statistics Sweden.
When data have been collected, they pass a number of process steps, such as editing and manual examination. A correction of a value means that a new data generation is created. This generation of the data has a different source than the original value, and has. Every generation contains a reference to the operator or service that made the correction, a status code that indicates why and how the change was made, and a time stamp that records when the change was made.
Parameters and rules
Parameter refers to a characteristic that can be considered constant for a given situation but which can assume different values in other situations. A parameter can be used as an input to a rule, which means that the rule is defined input parameters. Parameters can be inputs to a service and control the service logic, which in turn is controlled by predefined (with parameters) rules that are embedded in the service. Other input to a service is not seen as parameters here, it is called technical information or support information. Parameters cannot be an output from a service, but the result of a service can be used as a parameter to control another post in a process step. It answers the question “What information does a process / process step / activity need to function?”
Whether a process step is a manual or an automated one something must describe how to perform the process step. In the automated case there is a code, e.g. SAS, and in the manual case then there is a text, such as a work routine description. Both code and work routine description may contain rules that affect the outcome of the process step. A SAS program may look like ”If &var1.=^ '.' then &var3.=&var1.; else &var3. =&var2.; end;”. A work routine description might look something like this: "If there is industry data from the LFS it should be used, otherwise the Short-term statistics, wages and salaries, private sector will be used as the source for the calculation".
Generic and statistical program specific metadata
The common metadata repository can be divided into two logical parts: the generic part, which is valid for the whole organisation, and the statistical program-specific one. The generic part contains metadata that is common to the whole statistical production process and should be maintained centrally. When a specific statistical program is designed, conducted and documented, its metadata are derived from the generic metadata, creating the specific instance metadata. The metadata that are specific to a statistical program will also be shared and reused by other statistical programs.
Generic and product specific metadata:
![](/download/attachments/109253448/image2015-4-10%209%3A38%3A12.png?version=1&modificationDate=1428651492802&api=v2)
Connecting data and metadata
The (yellow) structures part of the metadata model provides information about how the data are physically organized and where they are stored. The key connection point is the instance. An instance variable provides a link to the Data point where the Datum representing a variable for an individual object is stored. This link appears in both GSIM and the Swedish model, but in addition the Swedish model also adds a Context variable (in the Concepts part), which expresses a represented variable with a specific role, reference time and source. The context variable can be linked to one or more Data columns (an addition to the Structures part). This link enables the model to be the basis of a metadata driven production system.
Adoption of GSIM
Statistics Sweden’s adaptation of GSIM is intended to be used in all future development projects. The main use is going to be to describe the inputs to and outputs from activities in the production process, but also to work as a reference model for information and metadata requirements for creating common production tools for the statistical business process. It will also work as an agent for contents harmonization. Modernisation of statistical production is not a technical issue ‒ it covers business, methodology and IT. The main method to improve efficiency is by reusing already collected information. Content harmonization in this aspect requires metadata to describe the data. This also leads to a reduction of the respondent burden.
GSIM will be incorporated in Statistics Sweden’s information architecture and the information models will align with GSIM. The Information models are made available in Statistics Sweden's common model libraries. This model library can be seen as Statistics Sweden's corporate memory. Models are successively built and constantly developed and redeveloped by requirements from projects and reused in other projects.
All development projects should require involvement from enterprise architects during start-up. A model library should be made for information models in order to promote reuse and further development. The models may also be further developed during the projects and this information should then be documented in the model library after it has been quality assured by the enterprise architects. This will retain a holistic perspective, ensuring that basic modelling principles are followed. It also minimizes sub-optimization and unnecessary variations in support of the process-oriented approach.
The GSIM-based metadata model is going to be used for creating a coherent layer based documentation system for reference metadata that allows for different kinds of outputs based on user needs. This makes it possible to connect various forms of reference metadata (such as national and international/Eurostat requirements). Process metrics can be used in the reference metadata. The goal is to create a documentation system that collects metadata as it is created making it possible to make successive documentation during the production process - documenting the activities as they are conducted. This means that the core of the documentation is created during the design process.
System overview
The model is structured into the same four areas used in GSIM to structure the information objects. The areas are business, exchange, concepts and structures.
Overview diagram:
One of the purposes of the model is to support coordination and standardization of processes. The picture shows how the production of statistics in the business process model interacts with the metadata model in each process (at any level) in the process model. A process is related to its input and output as described in the model of the central Metadata Repository. The picture shows how one process uses input and creates output as described in the model. The design of the actual process step is controlled and shaped also in accordance with the model.
Process flow:
Concepts
The green part of the model is consistent with GSIM Concepts. It describes the basic concepts in statistics linked to the statistical production process and data structure in a systematic way. These objects are the conceptual contents used as input and output of process steps. The objects include descriptions and definitions of what the statistics measure in the practical implementation. The objects used are linked to the data and can be described (documented) in reference metadata.
Unit type is a group of objects of interest that have common characteristics. They can be defined in a Population, which is a Concept that is a set of units of a particular unit type defined by common characteristics.
In the model, there are four types of variables: Conceptual variable, Represented variable, Context variable[1] and Instance variable.
- A Conceptual variable measures the characteristics, it is a Concept
- A Represented variable is a Conceptual variable and is linked to Unit type and Value domain.
- A Context variable is a Represented variable linked to a specific data source and a Population
- An Instance variable is a Context variable that has actual values for a certain unit.
A Value Domain can be a Described value domain or an Enumerated value domain. A Described value domain can be continuous or a simple describing text.
Categories are Concepts that can be used in three ways in the central Metadata Repository. They are described as types of Nodes - Category Unit, Code unit or Classification unit. Categories are grouped in Nodeset. There are three such types of groups: Category set, Code list and Classification. A Category set is a set of Category Units that contains the meaning of a Category without any associated representations, e.g., woman and man. In a Code list the Code units include Category meanings combined with a Code unit, e.g. 1‒woman and 2‒man. A Classification is a Code list that meets the criteria that the Classification units are mutually exclusive and exhaustive for each level.
Business
In the blue part of the model the design of a Statistical Program is described. Here it is determined which Process steps that are to be included in the Statistical Program. For each Process step it is determined which Methods and Rules that are going to be used, but also the information that a Process step requires as input to be executed. The results of the Process step are also identified, as well as the Process metrics that are created during the Process step. It is also important to describe the order in which the different Process steps must be performed. Based on the decisions taken in the design it is examined what IT-services are most suited to perform the specified process steps.
Once the design is completed and implemented in automated or manual procedures, implementation will result in refined data and process metrics for each process step
Exchange
The red part of the model describes how, when and why the information collected in each statistical program and supplied to external customers and users.
Structure
In the yellow part of the model (structures), the variables described in the concepts part of the model are connected to logical data units that are structured as units or dimensions and stored physically in database tables or files. A variable’s role in a certain data structure is described here: to identify, measure or have other roles.
[1] The Context variable is a Swedish addition to GSIM that expresses a Represented variable with a specific role, reference time and source (belonging to a specific population)
Process description that this GSIM implementation supports
At Statistics Sweden, we have made a principle design for the statistical production process to clarify the input and output that is relevant in various process steps. The design principle is used when the Swedish version of GSBPM is described as a process flow from establish needs to disseminate. The process flow is a part of Statistics Sweden´s process architecture. In order to obtain a uniform description it was important to describe the input / output as that in Design and Plan and to show how it is used in Build and Test and then that this is to be the basis for running and executing the production itself. Input / output is tied to relevant GSIM objects.
Business case
The main areas of GSIM use are for:
- A metadata system for support of the whole production process
- Design driven (or Metadata driven) production
- Development project for a generic production environment (using common metadata repository)
Desired situation:
Current situation:
Design-/metadata-driven and process oriented production are two sides of the same coin. Both focus on improving production efficiency through standardization, which in turn is accomplished by dividing production into components, or modules. A module can be virtually anything used in production, provided that it can be clearly defined and described, and has a clear role. The value of a module is that it can be reused. The description or definition of a module is its metadata. Depending on the module's character the description will have different content and scope.
A possible scenario
A statistical program is planned in the statistical production process’ design step, which is a Process step. It will be implemented as a number of Business Processes (for data collection, processing, dissemination, etc.). For one Business Processes establishes which Unit type(s) to be examined, processed or published and which Variables to measure the properties of the objects, as well as how the objects will be delimited to relevant Populations. For each variable, one can also specify its Value domain and, if applicable, also what Question that should be addressed to the respondent to give values to the Variable. In some cases, unit type, population and/or variable only relate to the outcome of the (target of) business process (as the Process output data), made observations (as Process Input data) or be something that is only used in the business process (throughput in a Process step instance).
The user describes the metadata by creating a Business Process by mainly reusing object types, populations and variables that are already registered in the metadata repository. When needed, new ones can be created. They are then Process inputs. In order to enable operational use of metadata a connection to technical (structural) information is created, indicating where in the data warehouse the information can be found.
The planning of the statistical program continues by clarifying how each business process shall be conducted. The production within the business process shall run through a number of Process steps. According to the plan they shall be carried out in a certain order: sequentially, parallel or according to another planned pattern which creates a process flow according to a defined Process control. Sometimes the order of the steps is binding: it is not possible to move on to the next step unless a certain condition is fulfilled, such as all process steps so far are finalized with satisfactory results; in other cases the rules can be milder, more like warnings.
The individual Process steps are standardised and can be found in the Process Support System, where they are described,. The required information (Process input) is defined, and the expected results in the form of production data and process metrics (Process output). There are also standard flows that serve as templates for creating specific instances. A flow describes the relation between the individual steps, such as in what order they shall be conducted, what conditions must be satisfied in order for a process step to be initiated, what happens if they are not satisfied, etc. Relations between process steps can be hierarchical, so that one step is defined as consisting of several others (process steps, , activities), which in turn can have a certain internal flow. The relationship between process steps can be described in a rule (Process control) that is well defined and reusable. A rule can also refer to an internal property of a Process Step.
The user describes "his" flow for the business process primarily by re-using already created steps and flows. If necessary, new flows can be created, if possible using a template from the Process Support System, to freely assemble a flow from existing steps, or by creating entirely new process steps and putting them together.
When the user has described the flows, processes, input and output data, etc. for all Business processes planned for the statistical program a complete tentative production flow has been created and documented. Before the plan and the documentation are definite they have to be tested. The tests will show if everything is in place, that all formats are as expected and that everything is connected so that it is reasonable to assume that a production can be implemented according to the plan without interferences. After the test, the plan and documentation are formalized and made available as a part of the central Metadata Repository.
In the next step the production is conducted sequentially according to the plan. In order to start a business process a copy of the plan is retrieved to create a local instance. Depending on the system requirements the user gives the start command, specifies a start time or sends orders to another operator, such as a collection unit, to start the production. The process step refers to what Services will perform the work. They are listed in a directory that is part of the central Metadata Repository. The Service itself is not included in the metadata repository, but is part of the "Platform", the means for accessing the process oriented tools. The services perform what is described in the respective Process steps with the help of the input data and control information specified there (Process Input). They produce output data and process metrics (Process output), which can be used to monitor and control production. Some process metrics will also be used as automatic control information for other process steps and function as Process input.
After completed production, the user has an almost complete documentation of their Statistical program. The documentation in the form of reference metadata (such as methodology and quality) needs to be completed manually, but the entire basis is in the process metrics and the central metadata repository, which can be described as Referential Metadata Set.
GSIM use
The ongoing coordination project for creating a production system for Short term economic statistics (KLON) is a kind of pilot project for the data warehouse strategy ideas and provides a test to attempt to define requirements for a common metadata repository. A number of factors that will have an impact on whether the project can meet the requirements are listed below.
- Prerequisites:
A "light version" of the central Metadata Repository
It should look like and act as the outlined central metadata repository and meet the most essential tasks. Connecting the Triton system’s metadata to MetaPlus must be managed. - Connecting metadata to data
It should work with the central metadata repository (the light version mentioned above) whether it in the early phase exists a structured data ware house or not
Desired conditions:
- Coordinated status codes, incl. principles for the management / administration of this
- Principles of standardization of parameters
Print screen of the KLON system:
Some experiences from the project are mentioned in the section on “Lessons learned”.
Relation to other Models
.
Statistics Sweden uses GSIM together with the Swedish version of GSBPM (see the GSPBM case study).
For business processes that are not part of the statistical business process Statistics Sweden has developed a process chart, in line with GAMSO.
Statistics Sweden´s process chart:
Design
Principles for the central Metadata Repository:
- The metadata is stored when they are created and then re-used where they are relevant.
- The metadata shall be used actively, in the sense that they support metadata-driven production, which means that it is through metadata production is supported and controlled.
- The metadata that is established as common; such as classifications and standardised variables, should have the common metadata layer as the source.
- Common concepts should be used in a uniform manner and based on common definitions, which are documented in a thesaurus.
The central Metadata Repository must meet the following requirements:
- The central metadata repository to support the handling of business processes.
- The metadata shall support version and generation management of data.
- Principles shall be provided for the update, ownership and permissions / security.
- The metadata shall be presented in several languages if required, at least in Swedish and English.
The main way we use GSIM in Statistics Sweden is to adapt our existing information architecture to GSIM. This ensures that the projects which use our information model works in accordance with GSIM and the business doesn´t need to have specific knowledge of GSIM.
Statistics Sweden´s information architecture consists of two levels – object groups and detailed information models. The contents evolve within development projects, which mean that there are no detailed information models for all object groups.
Object groups:
Licensing
New Information Objects and/or new specialisations of GSIM Information Objects
.
Groups of objects
Statistics Sweden´s information architecture has several groups of objects that are used in other processes as well as those shared between several processes like Personnel and Organisation. The circled object groups are specifically used for the statistical production. The object group Structure can be used by other processes.
Highlighted object groups
Context variable
A Context variable is a Represented variable linked to a specific data source and a Population. This new object was created in order to make the link between concepts and structure.
Lessons learned
.
GSIM as such is not possible to be used directly of the shelf, it is a conceptual model and it requires a lot of work to adopt it to the NSI-level. It is difficult to connect the different parts of GSIM. As soon as you start the level of details required are very large. This does not mean that it is not useful, quite the opposite, it is extremely helpful in a lot of cases since it provides a solution to existing problems and can therefore be used as a reference. GSIM has proven very valuable as a foundation in discussions regarding a central metadata repository and has provided a common vocabulary in those discussions. It has been a source of inspiration in the efforts of constructing a production system for short term statistics (KLON, mentioned above). During this work a number of lessons has been learned.
Experiences in short:
- The importance of reusing metadata
For example, allowing the user to use the names and codes of variables and value domains from "Concepts" in "Business" and "Exchange" to configure the design and also what will be delivered to internal and external users. - Harmonized metadata
It is not a prerequisite, but significantly facilitates the work. By giving the same code and name to variables and values that by definition conceptually are the same thing. - Periodicity
Often there are data with different periodicity available. In order to have a general production system there needs to be a property that describes the periodicity on transformable input and output. The periodicity may be different between input and output. E.g. monthly data can be transformed into both quarterly and yearly data so that the output out of a service can be all three types of periodicity. - Classify variables
In order to aid users of the production system, but also in the building of common services that can be used by more than one statistical program, is it useful that variables are classified as different types. A variable can be of a different type depending on the context of its use. - Traceability - and reproducibility
Not only for versioning of data, but also for metadata. This is closely tied to versioning of services. - Codes (status and ”other” codes)
By using process step as an attribute it is possible to monitor where in the business process a values is used, given that the process flow is defined. This means that a lot of often occurring status codes are not required. What is left are status codes for a specific values, these are considered as generic status codes. - Exchanging data
To a large extent data is exchanged between different statistical programs within a NSI. This is often done by hard coded database-to-database solutions. Or even worse, e-mail. One way to avoid that is to store data in the same structure. Harmonized metadata is necessary for this.
Suggestions for changes to GSIM
Links: |
---|