The following document charts the historical arc of metadata management strategies and developments pursued by the ABS, particularly over the past two decades, culminating in the strategy for transformation of statistical information management that we are now following.
This history is of importance to the present because the current situation for metadata management within the ABS, and some of the challenges to be faced in the future, reflect past strategies.
While this "history" document is relatively long, it is hoped it sets out a relatively coherent context for the current situation, lessons learned and future plans described throughout the case study. Various sections of the case study refer extensively to the historical, and forward looking, perspective on ABS strategy set out below. In this way the extent of context that needs to be reiterated at different places within the case study document itself is reduced significantly.
For fun, rather than representing a rigorous framework of "metadata paleontology", the historical arc described below is broken into a number of eras.
Premetazoic Era (1905-1973)
As with other statistical agencies, ABS processes and outputs involved some degree of "metadata" management even before the term was coined formally. As the term hadn't been invented, however, the ABS didn't yet have a metadata management strategy.
Protometazoic Era (1973-1990)
During the 1980s the ABS (along with many other agencies) undertook a number of major "data dictionary" projects that assembled basic definitional and structural metadata related to the individual "data elements" collected, derived and output by the ABS. Some of these initiatives created data dictionaries that spanned multiple related surveys (eg a range of "business surveys"), and allowed definitions of common data elements to be shared and reused consistently across this set of surveys. None of these initiatives were fully corporate in scope. While these initiatives did engage subject matter statisticians (usually technically oriented ones) they tended to "grow out from" new IT capabilities, rather than the specific IT capabilities being secondary to a business driven metadata strategy.
Mesometazoic Era (1991-2000)
Establishing the ABSDB
At the start of the 1990s the ABS initiated a major focus on "data warehousing". Rather than supporting a series of different "stove pipe" survey specific output systems, many advantages were identified in establishing a "output data warehouse" as a "single version of the truth" when sourcing output data from surveys for dissemination and for secondary use within the ABS.
In 1991 the ABS was fortunate enough to have Professor Bo Sundgren undertake a five month review which resulted in an excellent paper entitled "Towards a Unified Data and Metadata System at The Australian Bureau of Statistics". This paper envisaged three components for an "ideal" ABSDB, namely "macrodata", "microdata" and "metadata".
From the outset there was a strong focus on the metadata required to support the output data. Major repositories were developed to collect and structure metadata related to the following:
- statistical activities
- Termed "collections" by the ABS, these activities include surveys, censuses, statistical analysis of administrative data sources and statistical "compilation" activities such as preparing the national accounts.
- These are specific structured data files, data cubes and tables associated with statistical activities. Examples include various "unit record files" and aggregate outputs.
- data items (data elements/variables)
- ISO 11179 was just a glint in someone's eye at this stage. Information about "data items" was recorded using an ABS specific data model. Many similar underlying characteristics (eg a distinction between enumerated and non enumerated value domains) are recognised by the two models but the details of modelling are different
- Once again, an ABS specific data model was used.
- Captured in the ABS Glossary
A lot of the metadata that started being documented in corporate facilities to support output data had already been entered in different forms elsewhere in the statistical cycle including:
- internal planning and approval documentation, public consultation documentation associated with initiating a statistical activity
- all the individual processing systems associated with a statistical activity
- either entered as metadata or "hard coded" into each system
- "Concepts, Sources, Methods" and other publications associated with that statistical activity
Assembling this "extra" metadata, which then provided little direct return for subject matter statisticians, was often regarded as an overhead. The quality of the metadata provided initially was often questionable and it often then wasn't actively maintained over time.
Nevertheless, this era established a core of metadata in common corporate repositories in accordance with a common (but ABS specific) data model. It provided a platform for all that followed.
Most of the metadata repositories developed during that time are still with the ABS, and they have evolved and been extended during subsequent years. For example, most now offer some degree of "services interface" which allows content from the repositories to be called up from within processing applications rather than needing to "jump into" a repository specific application. Nevertheless, apart from a Data Element Registry that was developed during the next decade , these applications are yet to be completely redeveloped to allow them to adopt new IT architectures and standard metadata models that have evolved since the early 1990s. The legacy from this era is now both a corporate asset and a corporate liability.
Neometazoic Era (2001-2007)
From the outset there was the notion that the metadata facilities developed during the previous era should, for many reasons, be extended to serve purposes beyond documenting output data. Major action wasn't taken on this front until around the turn of the century, however, by which time the output data warehouse was firmly established and integrated within ABS output processing and dissemination workflows.
Establishing the Corporate Metadata Warehouse (CMR)
Around 2001 the existing metadata facilities were recognised as the foundation of a Corporate Metadata Repository (CMR) which had an identity separate to the output data warehouse itself. (The latter was by now termed the ABS Information Warehouse (ABSIW)). While support of ABSIW metadata requirements remained an important purpose for the CMR, its mission now extended to supporting all aspects of the statistical cycle. (At the instant it was first formally constituted, of course, the CMR did not yet possess the capabilities required to fulfil many aspects of this extended mission.)
Business Statistics Innovation Program (BSIP), Input Data Warehouse (IDW) and Integrated Systems for Household Surveys (ISHS)
In February 2002 the Australian Bureau of Statistics (ABS) took the strategic decision to proceed with a major re-engineering initiative known as the Business Statistics Innovation Program (BSIP). The objective of BSIP was, through the use of innovative technologies and methodologies, to re-engineer the ABS's business statistics processes, so as to improve the quality and relevance of our business statistics in a manner that is most efficient for both the ABS and its providers.
An early element of BSIP was a massive re-engineering and consolidation of its approach to managing "input" data, from both survey and administrative sources, related to businesses. Similarly to the case with the ABSDB in the 1990s, metadata was of crucial importance to this Input Data Warehouse (IDW). By this time, however, metadata standards such as ISO/IEC 11179 had become well established and accepted internationally. A defining characteristic of the new era was an emphasis by the ABS on applying standards which had emerged since the 1990s wherever, and to the extent, they could be applied in support of the achievement of ABS statistical and business objectives.
The IDW, and processes associated with that data store, quickly became "the second big target" for support by the CMR.
A couple of years subsequent to initiation of the IDW project, a major project was initiated to redevelop, extend and modernise the existing environment in which household surveys were developed, initiated and processed. Once again metadata was a key consideration for this Integrated Systems for Household Surveys (ISHS) project. Part of the ISHS redevelopment included alignment with ISO/IEC 11179 and with many other aspects of the new metadata framework associated with the IDW.
Where the ABSDB replaced "output" stovepipes in the 1990s, IDW and ISHS greatly reduced the number of separate input "pipes" although they did not result in "just one channel". (A range of statistical activities remain, to the present day, outside the scope of either IDW or ISHS.) The reduction in the number of "pipes" to be supported made it easier to apply the CMR to support, in practice, end to end statistical activities within the ABS.
Data Element Registry (DER) and Questionnaire Development Tool (QDT)
The ABS Data Element Registry (DER) was developed during this era. It is integrated with relevant CMR components that predated it, such as those related to statistical activities. While its long term objective is to support definition, management and reuse of data elements through all stages of the statistical cycle, the first target was support for IDW and ISHS related requirements for data element metadata.
The ISHS project included the development of new repositories and services related to questions, question modules and collection instruments. This development, known as the Questionnaire Development Tool (QDT), was designed to integrate with the metadata related to data elements from the DER. While the repositories related to questions and collection instruments were initially built for ISHS specifically, they were designed in consultation with "architects" for the CMR and IDW with the intention that at some stage these repositories could be enhanced and extended to meet broader corporate metadata requirements related to questions and collection instruments as an integrated part of the CMR.
Strategy for End-to-End Management of ABS Metadata (2003)
The new considerations and directions evident at the dawn of this era provided impetus to the development and formalisation of the Strategy for End-to-End Management of ABS Metadata over 18 months up to November 2003.
The strategy set out:
- a model to work towards in terms of how metadata should be structured and accessed to support "end to end" purposes
- a set of metadata management principles which, if applied consistently to all new systems development work undertaken by the ABS, would lead the organisation toward that integrated metadata management environment
- processes to facilitate the application of those metadata management principles to future developments undertaken by the ABS
Twelve principles were defined as a cornerstone of the 2003 strategy.
- Manage metadata with a life-cycle focus
- All data is well supported by accessible metadata that is of appropriate quality
- Ensure that metadata is readily available and useable in the context of client's information need (whether client is internal or external)
- Single, authoritative source ('registration authority') for each metadata element
- Registration process (workflow) associated with each metadata element, so that there is a clear identification of ownership, approval status, date of operation etc.
- Describe metadata flow with the statistical and business processes (alongside the data flow and business logic).
- Reuse metadata where possible for statistical integration as well as efficiency reasons (no new metadata elements are created until the designer/architect has determined that no appropriate element exists and this fact has been agreed by the relevant 'standards area')
- Capture at source and enter only once, where possible
- Capture derivable metadata automatically, where possible
- Cost/benefit mechanism to ensure that the cost to producers of metadata is justified by the benefit to users of metadata
- Variations from standards are tightly managed/approved, documented and visible
- Make metadata active to the greatest extent possible
The strategy proposed the twelve principles be applied when planning and authorising all ABS projects that provide, and/or make use of, metadata management capabilities, even those where metadata management is a secondary rather than primary objective or requirement.
Other key points in the 2003 strategy included
- There should be an agreed conceptual metadata model linked to processes that are part of the statistical processing cycle. This linkage should be used to determine what metadata should be collected.
- The ABS metadata model should take account of and uses international standards where possible.
- The physical implementation of the metadata model is the Corporate Metadata Repository (CMR) to be used by all ABS projects. CMR consists of a number of shared physical databases.
- All metadata entities are managed by a 'registration authority'
- Roles and responsibilities are identified
- Data Management and Classifications Branch (DMCB) is responsible for coordination, definition and maintenance of metadata policies, procedure, systems and provides advice and consultancy to developers related to metadata matters.
- DMCB is the 'registration authority' for the CMR and ensures that other organisational units with this role for particular metadata entities understand that role, are trained and have relevant tools.
- Metadata management is part of every project and should be considered alongside resource allocations and accountabilities in the same way as business processes and data flows are considered.
- Governance of metadata management developments and the oversight of outcomes realisation is vested in line management, existing project and program boards with ABS Executive taking an ultimate corporate view.
Progressing the strategy
Major developments such as IDW and ISHS subsequently did assist in advancing these principles, and the ABS metadata management environment more generally, although not to the extent - and not with the level of coherence - originally hoped.
One issue was that the new corporate "metadata infrastructure" delivered since 2001 had featured a "service" oriented architecture (SOA) designed to allow it to be readily "plugged into" existing processing systems. Actual take up by processing systems, however, was much slower than anticipated for a number of reasons. These included:
- lack of funds (and business drivers etc) for updating existing processing systems
- existing processing systems being "monolithic" and not readily able to "take up" the corporate metadata services
- "special needs" of existing processing systems which were not fully met by the generic services and, at a minimum, required the added complexity of marrying up "corporate" metadata with local extensions
- lack of a standard way to specify information about a specific data element, classification or statistical activity (which was originally defined from a statistician's perspective) in a way that each element of IT infrastructure could make systematic use of this specification
- inability of each processing system to provide metadata to describe "what it has done"
- eg even if the "input" data elements have been defined, once transformation processes are undertaken within the processing system there will be no description of the resultant derived data elements
The 2003 strategy of progressing ABS metadata management capabilities by "piggybacking" on other projects was at first characterised as "opportunistic". The ABS subsequently concluded that, at a minimum, a planned "incremental" approach was required.
As initial design and development of the DER neared completion, proposed programs of subsequent work related to the CMR were set out during 2006 and 2007 that stretched out across many phases over several years. These comprised a mixture of developing new capabilities and redeveloping and extending the older corporate metadata systems to integrate better with:
- each other
- modern systems architecture, and
- international metadata standards.
These proposed work programs were generally agreed to be "worthy" by the ABS Executive but not "compelling". "Compelling" appeared to have become a higher bar than once was the case, partly because resources for investment in local and corporate development projects were much more limited across within the organisation as a whole and partly because past investment in improving metadata management had generally not realised the anticipated business benefits as readily, quickly and fully as anticipated.
While the proposed programs of new work were not compelling enough to be funded, the ABS Executive did not seek to redirect existing operational funding and "close down" the CMR. Instead, each proposal drew a request to undertake further consultations and analysis and explore different ways forward.
Holisticene Epoch (2008 to Present)
The need for 2020 Vision
The strategy document from 2003, and subsequent papers, contained extensive discussion of objectives, principles, business drivers, benefits, models, proposed work programs etc related to metadata management.
In February 2008, however, the ABS executive noted these documents did not provide a clear and compelling "picture" of how the organisation aspired to be in the longer term. For example
- If the ABS did achieve the end to end metadata model set out in the 2003 strategy, what would it mean practically in terms of changing and improving the way the organisation operates?
- What new capabilities would it deliver, and are they the capabilities we need most?
- How certain is the ABS that "the future" the 2003 strategy targets is both achievable and the "most appropriate" future for us to be investing in working towards?
This led to a request to develop a "2020 Vision" encapsulating longer term ABS aspirations. Having clearly defined the state the ABS aspires to reach longer term, the next step would be to determine the most appropriate strategy for moving forward. Compared with the 2003 Metadata Strategy, not only might the target change fundamentally, but the preferred method of achieving it might (or might not) change fundamentally from the "opportunistic" or "incremental" approach of the previous era. (Organisational and project management challenges and risks attendant with trying to successfully manage a "big bang" approach were well recognised alongside the challenges and limitations attendant with an opportunistic/incremental approach.)
The practical complexities and challenges of "getting the rubber to hit the road" in terms of using metadata in an "end to end" context to drive actual ABS processes were much more fully appreciated by 2008 than they were in 2003. Some of these complexities are technical, but most relate to governance and organisational culture and, in particular, to co-ordinating and reconciling information models and requirements across ABS business processes, data and metadata repositories and processing systems.
An important emerging consideration was that the frame of reference for ABS metadata management requirements had become less and less defined by the boundaries of the organisation itself. Some of this breaking down of boundaries was been driven by the ABS itself, seeking to develop infrastructure that could be shared with other producers of statistical data within Australia in order to assist them in contributing greater volumes of higher quality content to Australia's National Statistical Service. Other government agencies within Australia had their own focus on interoperability, including metadata and standards to support interoperability, which the ABS needed to recognise. Interest in software collaboration, particularly open source software, rather than purely "home grown" systems had also broadened the focus on metadata and interoperability. The focus on collaboration in regard to software development and to making data and services available meant that the ABS was not only increasingly working alongside other statistical agencies but also seeking to interoperate with agencies whose data content was more "administrative"/"transactional", "geospatial" or "scientific research oriented" than "statistically" oriented. This need to consider integration with data models, data streams and associated metadata beyond the boundaries of the organisation added practical weight to approaches that were less ABS specific and underpinned by commonly agreed, applied and supported standards.
The considerations related to
- "getting the rubber to hit the road" in driving end to end business processes within the ABS, and
- interoperating in practice beyond the boundaries of the organisation
lead to pressure for metadata capabilities to be more flexible. The two considerations informed the development in the second half of 2008 of a proposed "SESAME Framework" (Standards Enabled Shared Active Metadata Environment). This envisaged a registry of "concepts and structures" which would allow preferred data and metadata concepts and structures (drawn from an agreed, internationally aligned, Corporate Information Model) to be clearly identified. "Legacy" concepts and structures (required by, and embedded in, existing repositories and processing systems) could be identified in the same registry (with a non preferred status), together with "structure mappings" that identified how the legacy concepts and structures related to the preferred ones. The mere act of having to explicitly identify concepts and structures associated with each existing repository and processing system, and to establish and agree the mapping of these to the preferred reference model, would be a huge step forward for the ABS.
At first the framework would function primarily as a "Rosetta Stone", providing a standard reference model for mapping data and metadata content into, and out of, different environments (enclaves) within the ABS and beyond the ABS. In this way it would provide a standard "bridging" format rather than, potentially, each environment needing to define and maintain completely separate direct mappings to each other environment with which it needed to interact.
Over time, however, as repositories and processing systems were redeveloped or replaced they would adopt preferred concepts and structures wherever possible. (Previously, despite good intentions, redevelopments often led in practice to yet another different - explicit or implicit - information model.) Due to the use of standard bridging, such changes internal to a particular environment would not have large impacts on other environments within the ABS "ecosystem". In fact, such internal re-engineering would streamline interactions and lead, over time, to the corporate information model more often shaping in practice "the way things are" rather than just "the way things should be".
The SESAME Framework proposed use of SDMX to define structures, and to exchange data and metadata based on those structures, using DDI (Data Documentation Initiative) to describe unit record level datasets. The use of well supported and flexible standards, that already interoperate with other relevant standards such as ISO/IEC 11179, XBRL and ISO 19115 (geospatial metadata), should also facilitate interoperability beyond the ABS - both in terms of drawing content in from outside the ABS and disseminating content - supporting the second emerging consideration described earlier.
Working notes and the presentation for the SDMX Global Conference in 2009 provide more background on the SESAME Framework.
The three paths
Subsequent consultations were held with heads of divisions within the ABS in regard to
- the business challenges faced by the organisation over the next decade,
- business benefits from addressing them, and
- options for addressing them - one of these being implementation of an approach to statistical information management based on the SESAME Framework.
These discussions identified a range of perspectives at senior levels within the organisations and different ideas about the best way forward. The support divisions related to methodology and to technology were highly supportive of an enterprise level approach seeking to achieve end to end metadata driven processes through applying a corporately agreed, standards based framework. While some of the feedback from other divisions suggested SESAME might not be the most appropriate framework for the purpose, or that the proposal and its business benefits were not explained clearly, more of the feedback challenged the underlying assumptions about business priorities and directions. Further tuning, and better communicating, the SESAME proposal was therefore placed on hold until the basic target could be agreed.
The three main paradigms identified in the discussions were :
- Continue developing and improving the individual business streams (those associated with IDW, ISHS, Population Census etc) with little emphasis on achieving convergence, or even bridging, across these streams at any stage in their end to end progress. This can simplify and expedite planning for local improvements, at the cost of some redundancy of effort and lack of consistency of outcomes across streams.
- "Wrap" the existing streams at the data dissemination (and possibly data collection) point with a common way of presenting (accepting) content to (from) the world outside the ABS. This can provide the appearance of coherence to the outside world, at the cost of individual business streams needing to map into (and out of) the common format(s) at a particular point in their end to end process.
- Seek to achieve end to end metadata driven processes through applying a corporately agreed, standards based framework (whether based on the SESAME proposal or otherwise).
The third option continued to be recognised almost universally as a worthy aspiration, but organisational commitment to it in practice - including the required changes to governance and culture in regard to information modelling - had not yet been confirmed.
The three paradigms related fundamentally to how the organisation sought to operate over the next decade or so (eg as a federation of separate business streams or as a unitary organisation conducting a diverse, but coherent, range of statistical activities).
By the second half of 2009 broader corporate, national and international considerations led to confirmation of the third paradigm.
Many of the relevant considerations are outlined in the paper The Case for an International Statistical Innovation Program - Transforming National and International Statistics Systems
The paper emphasises the importance of statistical information standards in enabling the level of organisational agility and collaboration required by NSIs to address the challenges to the future.
A commitment to a standard way of describing our information using a format (SDMX/DDI is proposed) would allow us to connect our statistical process steps together more easily over time as we develop or redevelop key parts of our systems. It would also enable others to provide us with functionality more easily.
The paper (eg para 35) also refers to a number of wider enablers in the form of internationally agreed frameworks and recommendations that had recently become available. For example, the GSBPM (Generic Statistical Business Process Model) reached full maturity with the release agreed in April 2009. The GSBPM provided a common reference framework when discussing the statistical business process (denoting all activities undertaken by producers of official statistics, at both the national and international levels, which result in data outputs), including its phases and sub-processes. This greatly facilitated discussion of, and sharing of ideas and designs related to, methods and IT applications to support the statistical business process. It also facilitated discussion, and common modelling, of the statistical information required as inputs, guides and enablers for specific sub-processes as well as the statistical information output from those sub-processes. Opportunities for effective collaboration between agencies to develop shared methodological and IT solutions were greatly facilitated by the establishment of such common frameworks and agreement on standards.
The international drive to harness these opportunities can be seen, for example, in
- the establishment of the CORA initiative on Common Reference Architecture in the second half of 2009, and
- the Sharing Advisory Board (aimed at promoting favourable conditions for the sharing and joint development of software tools and components amongst national and international statistical organisations) starting to form its proposed terms of reference and work program around the same time.
In addition to this work on frameworks, including common "industry architecture" for the production of official statistics, at a practical level there were increasing numbers of applications and utilities harnessing standards such as SDMX and DDI – with many of these having been able to be shared successfully. Examples included infrastructure developed by Eurostat, OECD, World Bank, European Central Bank and UNICEF and also by a number of development teams associated with government and university based research centres and national data archives.
The decision by the ABS Executive in October 2009 to formally confirm SDMX and DDI as the standards to form the core of the ABS's future directions and developments with regard to statistical information management represented, at that time, a bold step forward founded on emerging evidence.
The decision in October 2009 led to an intense period of planning and analysis in regard to how best to put this decision into effect. This culminated with announcement of the ABS Information Management Transformation Program (IMTP) in February 2010. Information available in the public domain in regard to IMTP so far is summarised on the linked page.
The future of metadata management in the ABS is inextricably bound to the broader "information management transformation" targeted by IMTP.
This can be seen as "integrating" metadata management within the core business of the ABS into the future.
In some discussions within the ABS (and more broadly) in the past two decades
- "metadata management" was characterised primarily as focused on concepts, documenting abstract standards, ideals, targets, "what should have happened" in regard to the statistical business process rather than actively driving and shaping the process in practice and then presenting in an accurate manner what actually occurred and the outputs produced
- "data management" was characterised primarily as an IT discipline of little interest and importance from a core business perspective
IMTP places a focus on Statistical Information Management which aims to span the range from conceptual to operational/technical in a coherent manner and avoid an artificial separation between "data" and "metadata".
A similar theme can be seen in modern standards such as SDMX and DDI which span data and metadata, providing broad coverage and seeking to work with other recognised standards that provide more detailed coverage of a narrower range of information (eg ISO 11179 for concepts, ISO 19115 for geospatial metadata, Dublin Core for discovery metadata).
The focus on statistical information management can be seen as giving "metadata" less profile as an independent area for strategy, modelling, management etc. This doesn't imply "metadata" is seen by the ABS as somehow less valuable and important as a resource, the contrary is the case. It means "metadata" is seen as an essential aspect of core business into the future, and less usefully treated as a separate discipline to the more general focus on "statistical information".