This work package will be deemed successful if it results in strong, well-justified and internationally-applicable recommendations on appropriate tools, methods and environments for processing and analysing different types of Big Data, along with a report on the feasibility of establishing a shared approach for using Big Data sources that are multi-national or for which similar sources are available in different countries.
The value of the work package, in the context of the overall goals of the project on the Role of Big Data in the Modernisation of Statistical Production and in relation to the overarching strategy of the HLG, derives from its international nature. While individual statistical organizations can experiment with the production of official statistics from Big Data sources (and many are currently doing so or have already done so), and can share their findings and methods with other organizations, this work package will be able to do the same in a more open and collaborative setting. The work package will draw on the international nature and/or international ownership and management of many Big Data sources, and will capitalise on the collective bargaining power of the statistical community acting as one in relation to such large transnational entities. The work package will contribute to the overall value of the project by providing a common methodology from the outset, precluding the need for post-hoc efforts to harmonise methodology in the future.
Successful completion of the work package will entail evaluation of the feasibility of the following propositions; and insofar as it is found that the propositions are feasible, it will demonstrate and document in broad yet practical terms how the actions would be achievable in statistical organizations.
- 'Big Data' sources can be obtained (in a stable and reliably replicable way), installed and manipulated with relative ease and efficiency on the chosen platform, within technological and financial constraints that are realistic reflections of the situation of national statistical offices
- The chosen sources can be processed to produce statistics (either mainstream or novel) which conform to appropriate quality criteria –both existing and new– used to assess official statistics, and which are reliable and comparable across countries
- The resulting statistics correspond in a systematic and predictable way with existing mainstream products, such as price statistics, household budget indicators, etc.
- The chosen platforms, tools, methods and datasets can be used in similar ways to produce analogous statistics in different countries
- The different participating countries can share tools, methods, datasets and results efficiently, operating on the principles established in the Common Statistical Production Architecture.
While the first objective is to examine these propositions (the 'proof of concept'), a second objective is to then use these findings to produce a general model for achieving the goal of producing statistics from Big Data, and to communicate this effectively to statistical organizations. Hence, all processes, findings, lessons learned and results will be recorded and will feed into work package 3 for dissemination and training activities. In particular, experiences and best practices for obtaining data will be detailed for the benefit of other organizations.
Basis for the Recommendations
The recommendations given in this annex were arrived at by the deliberations of a team of experts representing eight countries or international organizations, in consultation with the broader expert task team on Big Data.
The task team considered a wide range of alternative possibilities for tools, datasets and statistics and assessed them against various criteria. These included the following:
- Whether or not the tools are open source
- Ease of use for statistical office staff
- Possibilities for interoperability and integration with other tools
- Ease of integration into existing statistical production architectures
- Availability of documentation
- Availability of online tutorials and support
- Training requirements, including whether or not a vendor-specific language has to be learned
- The existence of an active and knowledgeable user community.
- At least one statistic that corresponds closely and in a predictable way with a mainstream statistic produced by most statistical organizations
- One or more short term indicators of specific variables or cross-sectional statistics which permits the study of the detailed relationships between variables
- One or more statistics that represent a new, non-traditional output (i.e. something that has not generally been measured by producers of official statistics, be it a novel social phenomenon or an existing one where the need to measure it has only recently arisen)
- Ease of locating and obtaining data from providers
- Cost of obtaining data (if any)
- Stability (or expected stability) over time
- Availability of data that can be used by several countries, or data whose format is at least broadly homogeneous across countries
- The existence of ID variables which enable the merging of big data sets with traditional statistical data sources
Recommendations and Resource Requirements
The task team recommends that this work package proceed according to the items detailed in each row of the following table:
|Aspect||Recommendations||Links to further information (where applicable)|
|Processing environment||HortonWorks Hadoop distribution to be installed on a cluster provided by a volunteering statistical organization.||http://hortonworks.com/|
The Pentaho Business Analytics Suite Enterprise Edition will be deployed under a free trial license obtained for the purpose of the project (for an initial period of six months with the possibility of renewal up to one year).
Pentaho Business Analytics Suite Enterprise Edition provides a unified, interactive, and visual environment for data integration, data analysis, data mining, visualization, and other capabilities. Pentaho's Data Integration component is fully compatible with Hortonworks Hadoop and allows 'drag and drop' development of MapReduce jobs.
Additional tools such as R and RHadoop will be installed alongside the Pentaho Suite.
|Datasets & statistics to be produced with them (or feasibility of production to be demonstrated)|
One or more from each of the categories below to be installed in the sandbox and experimented with for the creation of appropriate corresponding statistics:
|Human resource requirements||A task team will need to be identified at the outset of the project, composed of experts whose time is volunteered in kind by their respective organizations for the duration of the work package. The project manager's first task will be to identify the number of members required, the requisite skills and the amount of time to be committed by task team members to enable the work to progress.|
Assemble task team to lead sandbox work
|All those interested in participating will be encouraged to do so, but a task team will be required to steer the work, ensuring objectives are pursued and processes are documented.|
Obtain and install necessary hardware, software etc.
Undergo training of task team to ensure familiarity with technical tools and start collaboration between team members
|Obtain requisite datasets and undertake analyses in sandbox||July-October 2014|
|Produce a general model for achieving the goal of producing statistics from Big Data, to communicate effectively with statistical organizations||November-December 2014|