(From Guillaume) Suggest to add Statistical Register to Administrative Register in GSIM.

Or define some generalised concept of a register?

  • No labels

14 Comments

  1. Chris Jones

    (From Juan) Could be problematic when applied to registers for Big Data sources.

  2. Jenny Linnerud

    A Statistical Register can be defined as a register that is a regularly updated list of units and their properties that is designed for statistical purposes.

    A collection of Big Data could be a Data Resource, but it would not be a Statistical Register.

  3. user-8e470

    Meeting 24/1: ACTION: Need a more generic concept of register, then that could be extended as needed (administrative and statistical). 

    There is an issue (potentially) related to the subtypes of exchange channel. The types of exchange channel can be tricky. Some seem to be tools others are data sources. Is big data included ? Big data could be considered a data resource. Big data is not always organised. 

     ACTION: Francine Kalonji Alistair Hamilton Eva Holm - can you provide us with some information of issue implementing this?

     

     

  4. Alistair Hamilton

    While the current name "Administrative Register" creates some confusion/ambiguity, as an (incoming) Exchange Channel for us it relates to something that comes across the boundary of our organisation and can (usually does) have a Provision Agreement from whoever provides us with that input which determines what the ABS can do with the information they have provided.

    In this sense "Administrative Register" is our metadata related to an external data source (where we may not have a full description, particularly if the provider only gives us a partial extract from their underlying holdings).

    In a sense we currently use "Administrative Register" as "the input exchange channel that is not a traditional questionnaire".  (While we also have Web Scraper Channel in the ABS model, inherited from GSIM, we haven't done anything with it yet.)

    I think subclassing of (input) Exchange Channels makes sense, and it was always designed to be extensible.  We need to do detailed modelling of traditional questionnaires as an exchange channel, but there are other input channels where modelling for the Questionnaire subclass is not relevant.    

    Deciding whether "Administrative Register" should be broad in scope - and possibly renamed - to cover an exchange channel where, eg, we get supermarket scanner data as an input to CPI - or whether that should become a separate subclass (eg "Big Data" Channel) I am not sure.

    I am not sure how much "detailed and different" modelling we should need to do for, eg, receiving scanner data from supermarkets vs receiving building approval data from local government authorities (which fit Administrative Registries" better).  Differences might turn out to relate to, eg, modelling "streamed" vs "snapshot" input data - but "streamed" already applies to Administrative Registry exchange channels in some cases.  For example, when a business changes its registration details with the Australian Taxation Office (ATO)'s Australian Business Register (ABR) we are notified to possibly update the ABS Business Register.

    This is where I wonder whether Statistical Register is a new exchange channel (it could be if we are getting data from someone else's statistical register).  We have modelled the ABS Business Register (ABS BR) as a specific set of interrelated Data Structures.  We then have sets of metadata (not really in GSIM currently) that allow definition of Frames and Samples based on the ABS BR and the sample can then be consumed by Statistical Programs to allow them to approach the selected units from the ABS BR with Questionnaires.  The ABS Business Register is not the same thing as the ATO ABR, we do a lot of profiling of significant Enterprise Groups in the design of the ABS BR that does not come directly from ATO ABR as an Administrative Register based Exchange Channel.  Similarly updates to the ABS BR can be driven by feedback from ABS Statistical Programs who approached a business with a Questionnaire and found some detail in the ABS BR information for that unit wasn't right, or had changed.  In other words, not all updates to ABS BR as a "Statistical Register" are driven by the "Administrative Register" exchange channel we have with ATO.

    The modelling ABS has needed to add to support Registers, Frames and Samples (for businesses and addresses) is fairly specific and might be different, eg, to what would be sensible for a Nordic country.  If the requirement relates to describing "Statistical Registers" within whichever organisation is implementing GSIM to describe things - rather than something being exchanged across that organisation's boundary - than maybe (at least as a first step) an annex could be developed and added to inform on how such "Statistical Registers" can be modelled rather than trying to jump in 2018 to a fully fledged inclusion in core GSIM 1.3 modelling?                   

  5. Jenny Linnerud

    Would be good to have API as an Exchange Channel. In more and more cases our Information Providers are making their own APIs and then asking us to take what we need from there rather than them making an explicit delivery to us.

    Note that the Glossary group (Alice, Dan, ...) have been working on definitions of Register, Administrative Register and Statistical Register.

  6. Alistair Hamilton

    Jenny's suggestion would make a lot of sense for ABS.  It would be a more meaningful way of sub-classing Exchange Channels than adding a generic "Big Data" channel.

    For most "traditional" (external) Administrative Register exchanges ABS specifies the "extract" of data of interest (which may not be everything in the external source)  (The provider may, or may not, provide just the data requested by ABS in the format requested by ABS.)  The "extract" is typically then provided on (negotiated) periodic basis.

    APIs change the game because traditionally APIs are designed by an external organisation for more generic collaboration and/or data sharing purposes (not based specifically on the data ABS would like reported) but are then available to be used on a "self select, self serve", streaming basis (eg various statistical experiments with social media APIs but also use of APIs provided by other government agencies) rather than relying on periodic extract generation activities undertaken by the provider. 

    One reason the ABS business transformation hasn't invested heavily in support for web scraping exchange scenarios is that we'd prefer to see these existing scenarios replaced - where possible - by API based exchange channels in future.  API based exchange channels tend to be more robust and quality assured than web scraping ones.

    Legally, also, a number of data rich sites have provisions outlawing large scale web scraping in their terms and conditions.  These conditions potentially steer consumers toward paid API access instead.  Given the benefits of API based access, these may be costs that statistical agencies are willing to pay, as opposed to continuing with "grey" web scraping.  Even more positively, in many cases statistical agencies may be able to negotiate free access to APIs given their non commercial purposes.  This then becomes another benefit of being able to specify the Provision Agreement which applies to each API based exchange channel.

  7. user-c9ea8

    Comments:

    3 linked issues: #25, #73, #46

    Incoming channels are currently:

    • Admin Register
    • Questionnaires
    • Web Scrapping

    Outgoing is Product.

    Proposals:

    1. Admin register changes to broader name “Statistical Register”
      Add API as channel (Is this in addition to other types?)
    2. Create 2 subtypes of Exchange Channel (collection + dissemination)
    3. Change UML diagrams in specification to ensure only relevant relationships are shown (e.g. only show consume relationship for collection exchange channels).
  8. Alistair Hamilton

    Firstly, as an exchange channel, is "Statistical Register" actually broader than "Administrative Register"?

    1. Some data ABS sources through Administrative Registers is used only for producing statistics and not as a statistical register (eg births, deaths and marriages). 
    2. Some data we source from administrative registers is used primarily for producing statistics (eg motor vehicle registrations) but also lightly as a statistical register (eg selection frame for survey of motor vehicle use).
    3. Some data we source from administrative registers (eg external register of Australian Businesses) is used primarily to support our statistical register for business surveys but is also used directly as a source for producing statistics.

    Secondly, API is increasing a dissemination channel for us also.  In one case, however, we have a consumer's view of the API channel and in the other case we have a provider's view of the API channel so it would be fine if we ended up with API as a subtype (in different contexts) below both Collect and Disseminate. 

  9. user-8e470

    Hi Al,

    Those are some relevant points. What would you propose as a solution?

    • Add Statistical Register an additional channels (or just leave it as Administrative Register)?
    • Add API as additional channel?
    • Do not create the collection and disseminate subtypes so API could be used as both?
  10. Alistair Hamilton

    Hi Thérèse

    On the first point, my guess is just leave it as Administrative Register as the exchange channel (having looked at the Glossary Group's definition for Statistical Register), unless there are major cases for other statistical offices where they exchange statistical registers with each other?

    Taking the Motor Vehicle Registration Admin Data case, from this the ABS produces the Motor Vehicle Census (MVC) (pretty much straight production of statistics from Admin Sources) every year but some years we also sometimes run a Survey of Motor Vehicle Use (SMVU) that can use the Motor Vehicle Admin Data as a frame.

    Even in the latter case, however, our actual exchange channels are with 8 different state based motor registries that present data in slightly different formats and potentially with different conditions of use (part of the definition of an exchange channel).  We then have some internal work to convert what comes in from the motor vehicle registries into a frame for SMVU.  (ABS doesn't maintain a "statistical register" for motor vehicles, we just build a frame off admin extracts from 8 motor vehicle registries (all with the same specific reference date for the extract) when we happen to decide to run SMVU.  (Given we do MVC every year, we get the 8 extracts every year and transform them into a coherent "census", we just don't build a frame each year.)

    It is similar that what we exchange to get a refreshed list of registered business and their basic characteristics is an extract from an admin register with conditions of use set by the provider of that extract.  In the case of businesses, however, we do maintain a statistical register within the ABS and we add value to the admin data received by profiling large businesses etc within our statistical register.   We don't actually "exchange" the statistical register itself, however, we use it as a data and metadata resource within the ABS as required.  

    In terms of the second point, on the collection side, I see two main possibilities.  I'd mildly lean toward merging Web Scraping and API into something (which would require some explanation) like "Data Harvesting".  Traditional web scraping (a live discussion in the ABS at the moment) typically "harvests" data that a company has put on the web, where the "provider" typically puts it on the web primarily for human consumption.  This leads to extra cost and risk because the scraping method needs to convert this back into consumable data for the NSO.  Also the provider may change the "human readable and accessible layout" of the data - without changing the meaning of the data - and upset the scraping method.

    Plugging into an API is straight machine to machine.  That said, the API is typically not design specifically for providing data to the NSO, so work may still be required to reshape data from the API into the dataset a NSO wants to start processing internally, but it is a big step forward.

    Traditional Web Scaping and APIs are both "pull" methods (eg opportunities for the NSO to pull data whenever they want) where traditionally for both Questionnaires and Admin extracts the provider needs to fill out some data and then "push" it to the NSO. 

    This would probably imply that an Admin Register accessible via (provider provided) API would be a "Data Harvesting" case because once API is the exchange channel (rather than questionnaires for individual units or "admin extracts" spanning many units) it is not too much of a concern for describing the exchange channel (but it will be for internal statistical use of what's received) whether the data source feeding the API is administrative, transactional, IoT/Big Data etc.

    This seems consistent with other Exchange Channels.  In GSIM we don't have separate exchange channels depending on "what kind of thing" is populating the questionnaire even though it may make a statistical difference whether it was self enumeration, interviewer administered, reported by a third party on behalf of a unit etc.

    For both APIs and Web Scraping as "pull" methods the stating point conditions of use are typically generic for any user of the website/API.  For "push" methods, due to the provider needing to do something to push the data to the NSO, the conditions are usually set by the NSO (under a legal framework) for questionnaires and potentially a MoU for admin extracts.  

    There can be a case made for keeping Web Scaping and API separate, however, and seems fine too if it makes more sense to more people.

    I don't have a strong view on the third point.  I kind of lean toward having the separation of Collect and Disseminate and I don't think it matters if "API" ends up under both (from different perspectives).  That said, if we run with something like "Data Harvesting" it doesn't.

    With Dissemination the equivalent might be a split between "Packaged Data Products" and "Data Services" (once again, to some degree, a "push" vs "pull" contrast).  In the ABS case we use some APIs internally to create Packaged Data Products.  These would still be Packaged Data Products as an exchange channel (conditions of use are typically Creative Commons).  We also, however, provide external access direct to appropriate data services with appropriate conditions of use (sometimes much more detailed and sometimes not "general public access").

    Maybe for the coming version of GSIM there is no need to split out (Pre-packaged) Product from Service (which could conceivably include human delivered "customer specific services", not just machine to machine...or maybe keep it just M2M?) in which case API for dissemination would simply live as one flavour of "Product".

    Cheers

    Al

  11. Mikko Saloila

    Hi,

    We discussed this issue after the meeting shortly with Essi. To summarize, we agree with Guillaume, that API is perhaps a bit too technical to replace Web scrapping, and is not a separate thing from administrative register, rather a way (or a platform) to collect and/or disseminate information. Secondly, we think that web scrapping could be replaced with Data harvesting (as Alistair suggested). To us it seems to be a bit broader term than Web scrapping even though as non-native english speakers we might be wrong on this. I guess the main thing for us is to include also new data sources, for instance sensor data, under this term. 

    BR, Mikko and Essi

  12. user-8e470

    Meeting 15/5

    Not a lot of support for adding in the collection/dissemination subtypes.

    I think we are here with the proposal:

     

    TermDefinitionSource of definitionExplanatory text
    Statistical registerA statistical register is a register that is a regularly updated list of units and their properties that is designed for statistical purposes. Based on UN NQAF Glossary
    http://unstats.un.org/unsd/dnss/docs-nqaf/NQAF%20GLOSSARY.pdf
    A statistical register provides an (ideally) complete inventory of the statistical units within a specific population, and describes these units using different characteristics. One example is a business register held within a statistical office.

    All the statistical units in a statistical register have an identifier that makes it possible to update the statistical register with new information on the statistical units.

    Need to define Data harvesting if agreed

  13. InKyung Choi

    Meeting 6 June

    • Exchange - does not differentiate external exchange and internal exchange
    • Data Harvesting can include Web Scrapping and any kind of machine-to-machine exchange
    • Regarding Statistical Register and Administrative Register - there are big difference between them and distinction should be made (e.g. France maintains not only statistical register but also mandated by law to maintain some administrative registers, hence need both to be distinguished)

    Agreed

    1. to remove Web Scrapping Channel and Scrapping Process Map;
    2. to add Data Harvesting (will need definition, explanatory text, etc.); 
    3. to have both Statistical Register and Administrative Register (but flag this during global consultation for opinions)
  14. Alistair Hamilton

    Exchange - does not differentiate external exchange and internal exchange

    A key question is the extent to which GSIM intendeds that to be true.

    In GSIM 1.1 the channels are for collection and dissemination.  If you literally wanted to use exchange channel for all "internal exchanges" you'd need a channel "Hand Over Data" which is used every time data is passed 

    1. between Statistical Programs (eg from Retail Trade to feed into National Accounts modelling), or
    2. between a Statistical Support Program (eg an area maintaining a Statistical Register) and a SP that needs a survey frame based on the Statistical Register.

    I think adding Statistical Register would imply "2" is considered an Exchange Channel by GSIM but is it also suggesting "1" be considered an Exchange Channel?

    Basically for us ABS Business and Household Statistical Registers are Data Resources (composed of multiple data sets) from which Frames are extracted using business rules.  Application of the rules creates the Frame as a new data set.  The application of sampling via other rules then creates the Sample as a third data set from the Frame.  Ultimately a Statistical Program uses the Sample to drive an acquisition process that ends up with responses from sampled units.  The responses constitute a fourth data set.  From that data set processes and rules are applied to create new data sets (eg statistical estimates)  and so on.  In many cases a relevant subset of the processed data is passed on to National Accounts which then applies modelling across that data and other data to create new data sets and so on. 

    In general the chain of business processes, business rules and datasets works well.

    In GSIM terms this could be seen as operating within the "Business" domain, even where business processes, and the data structures they use, may belong to different Statistical Programs, and Statistical Support Programs, as part of the value chain within the business.   

    If a NSO was operating in a "silo" manner, rather than in an integrated fashion, it would be possible to think of them as being multiple "businesses" and therefore needing an Exchange Channel to collect and disseminate data between "Business 1" and "Business 2" processing.

    This might see Statistical Registers modelled as a completely separate set of "business activities" from which a statistical program "collects" information to generate its frame and to whom it "disseminates" information when survey respondents identify changes should be made to the Statistical Register (eg because units have died or otherwise substantively changed).

    I think that is pointing in the opposite direction to Statistical Modernization, but it may reflect some current practices.

    If GSIM does want to emphasise internal exchange channels it should give guidance on when this is an appropriate construct and the potential downsides.