(Feedback from ONS, 2 October, 2017)

GSBPM as it stands seems to be mostly focused on survey data, with a little bit of reference to admin sources. It would be good to see more mention of commercial or open data - for example, data obtained in  private sector, or web-scraped data. As a minimum, these data sources should be mentioned throughout the document alongside survey and admin data, but ideally the stages would be amended to fit a commercial/open data context.

A couple of specific examples which illustrate how commercial/big data could be incorporated -

  • In a commercial/open data context, particularly where we are streaming data (for example, getting a live feed of accounts returns) the 'collect' and 'process' phase may happen as a part of a single pipeline, and potentially result in 'live' updating of outputs. GSBPM as it stands doesn't necessarily preclude this, but it would be good to see it mentioned.
  • 'Sampling' (as referred to in paragraph 4.1) also takes a different meaning in a 'big' data context - in this context we might actually have too much data to process and so need to take a sample.
  • The 'process' phase seems to be predicated on the assumption that the statistical output being produced is a simple aggregate, rather than a model describing the relationship between multiple factors or some other more complex analysis. The steps 'calculate weights' and 'calculate aggregates' should probably be revisited to make them more generic and applicable to products other than aggregates.