Data Architecture for Old and New data sources

189. We tried to analyse a number of examples of new and traditional data sources, trying to generalise from the actual examples and investigating the general nature of newer data sources as compared to the more traditional ones. Our conclusion is that the newer data sources tend to provide data that is ever closer to the actual phenomena that the NSO is trying to describe. In the traditional cases, we still needed “agents”, usually of the human kind, to provide the data. Talking about registers, i.e. data already collected for other purposes by other parties (usually organisations), we can find ever more data sources that others have set up for an ever growing variety of reasons, containing or delivering data describing an ever increasing number of phenomena.

190. The interesting thing is, that this data seems to be ever more “raw”, “primitive”, ever more elementary. Instead of using data that is already “pre-processed”, filtered, coded, etc., we tend to tap into unfiltered, unprocessed, raw data, that is more detailed by nature. Sometimes, it is “unfocused”, as for example satellite imagery, where a certain image may be analysed for determining land usage, harvest estimation or counting the number of solar cells on roof tops. In other cases, the nature of the data is very focused, such as the output of traffic loops. These newer types of data very often are directly related to individual transactions or events (“happenings”), such as buying in a shop, money transfers, vehicles passing a certain spot on a road, making a phone call, goods passing the border, etc. etc. rather than the aggregated data that we used to collect in the past.

191. Where traditional data was purely alphanumerical, the new data is increasingly also of other types. Also, most of the new data is still of the alphanumerical type. But increasingly, we see data of other types, for now primarily imagery (still or video), but potentially also audio.

192. The general consequence of the use of such newer data is, that the statistical organisation itself has to do the processing. Or, at least, determine what processing is required to extract the desired “pure” information out of the “raw” data. Execution of this processing may be outsourced to others, such as the data provider or a third party. But determining the processing required involves a preliminary stage of investigation, analysis and specification, maybe even building prototypes.

193. Due to the more detailed nature of the new data, the volume is increasing rapidly. So, not only do we have to process more in order to extract the true value from the detailed data, there is more of it as well. Whether we “hit” more units of the population under study, still remains to be seen.

194. Below we try to analyse the consequences for the architecture derived from the different characteristics of current and new data sources.

Table 3: Old vs New Datasources

Current

New

Impact on Architecture

Structured

Un(semi) structured

Although the general perception might be that traditional data is structured, that impression is not necessarily correct. We should not forget that the data that an NSI receives is not the output of the collection phase, but the data that “is out there in the real world”. The NSI uses an instrument (the interview, structured through a questionnaire) to select, filter and format the data from the raw source, the interviewee. So, indeed the data received from the collection phase is pre-filtered (by the NSI itself). Because of the long experience, NSI’s have a lot of in-house knowledge about how to do this. In the case of new data sources, that the NSI is (as yet) unfamiliar with, it needs to study (a sample of) the raw data and decide how to filter, format and structure it. Example: web scraping, exploration phase for Big Data. The impact is that we need capabilities for this exploration and for designing and building the necessary instruments.

High Latency (infrequent)

Low Latency (frequent)

Streaming

Receiving the information (much) soon(er) after the event happened in the real world means generally better quality. It also means an opportunity to improve the actuality of the published statistics. In order for this to happen, the internal processing must also be speeded up, until the point of “straight through processing”. However, the nature of statistics, aggregation, puts restraints on this: per cell a minimum amount of relevant data must be received before the cell summary value can be calculated.

It will become a local trade-off whether to process directly or to store data awaiting more data.

Large Batch Size

Small Batch Size

Receiving small but frequent amounts of data opens the possibility to spread the load of processing over time, reducing the need for high peak capacity in the infrastructure that stands idle for the rest of the period. However, just like with latency, there are limits to the gain that can be reached.

Pre-filtered

Raw / Detailed

As explained above, the pre-filtering involved in Primary data collection and most likely also in the collection from Administrative Data sources, is almost “sub-conscious” in most NSI’s. Due to the amount of detail available in new data sources, also the volume of data tends to explode (hence the Big Data). All this data needs to be processed, so the requirement for the capabilities involved is clearly to be able to handle (much) larger volumes of data. Also, as explained before, the NSI must develop new methods and instruments to select/filter, format and potentially aggregate the detailed data. In certain cases, this work can then be off-loaded to parties (data providers) upstream in the value chain.

Controlled Quality

(Less Noise)

Uncontrolled Quality

(Noise Bias)

Like explained before, in primary data collection and to a certain extend also in administrative data sources, the NSI can control or at least influence the processes involved in the data collection, with positive effects on the quality of the data. In the case of Primary data collection, this includes the sample selection. Newer data sources in most cases do not allow for this up-front control, the data is what it is and the NSI needs to take measures after the actual ingestion to manage the quality, maybe discarding part of the data received as irrelevant or unusable due to bad quality.

In-House Processing

External Processing

Due to the large volume of data, organisations increasingly feel the need to off-load some or all of the processing involved in checking, selecting, reformatting, aggregating, etc. to stations upstream in the value chain. This involves mechanisms to express the algorithms to be applied, and sometimes even distributing actual code, as well as mechanisms to validate and manage compliance. Big impact on partner collaboration. Technically, this may or may not involve cloud technologies.

Less Complex Supplier Management

(Internal Skills sufficient)

More Complex Supplier Management

(External Expertise needed)

Both for understanding the meaning and usefulness of the data as well as for the collaboration on external processing, there is a need for closer cooperation between the NSI, its data providers and possibly intermediate parties involved.

The complexity can arise in a number of ways: (a) legal contracts could be more difficult to agree, (b) some suppliers will want to work with NSI as partners (e.g. supplier data for free in the expectation that there will be benefits to them from the statistical output, such as obtaining more detailed information or obtaining the information sooner).

Less Work for Standardizing / Cleansing

More Work for Standardizing / Cleansing

Due to the fact that the NSI has less influence on the upstream processes, and wants to be able to use data from more and different sources, it needs to invest more in improving the quality of data for its own purposes. As explained before, in the traditional cases, most of this work is done in the preparation of the data collection instruments (the questionnaires).

This however is not only a quality issue. It also requires a mapping exercise or a transformation process to ensure the source data is made to comply with the standards which are needed for statistical processing.

Mastering Reference Data Internally

Linking to External Reference Data

External data providers of digital data most likely have their own reference data. Instead of creating and maintaining an internal copy of that reference data, NSI’s may choose to re-use the external reference data by linking to it, as an (external) live repository. This involves collaboration on organisational as well as technical levels.

Storing Processed Data

Storing Raw Data

(Volume and Retention Period)


Traditional Legal Framework

New Legal Framework


Accessibility, Stability / Continuity

Accessibility, Stability / Continuity


Digitized

Digital by nature

Most NSI’s these days rely on digital data for their internal (automated) processing. Any data not in digital form must be digitized as part of the Ingestion phase. Most current CAPI, CATI and CAWI data collection practices includes this. Due to the nature of the new data sources, the data from these sources is already digital from the outset, eliminating the need for digitizing.

Human sources (mediated)

Machine Generated (Robotic sources)

There may be a need to understand the logic applied by the intelligent end-points and intermediate devices upstream as part of understanding the meaning an value of the data provided.

Processes that provide Machine Generated data will sometimes need training data in order to build the model / logic for processing the data. This process needs to be understood and reviewed periodically to arrange for the model to be re-trained where necessary.

Often No Partnership needed

Adaptive Design Partnership

The only partnership involved in primary data collection is the (willing or forced) collaboration of the interviewee (the ultimate data provider). For new data sources, as explained before, closer collaboration is needed in many cases, extending into the preparatory stages of design and build. Fortunately, these data providers provide data for more or all units, which means much less providers involved per survey.

195. Dimensions that can explain the differences:

  • Vs: Volume, Velocity, Variety/Variability, Volatility
  • External non-government (private) origin
  • Machine-generated Internet of Things
  • No possibility to influence data-sources: sometimes unknown structure
  • RealTime/Streaming phenomena
  • Quality (Veracity) aspects
  • Local/global sources, supranational data
  • Legal framework? Availability/accessibility? Ethical aspects
  • Need partnership with external users to validate/understand/give meaning to the data
  • No labels