115. All data collection is modelled in GSIM using the Exchange Channel object, which represents the mechanism by which data comes into the statistical organization. This object is always extended into sub-classes, to describe specific sources of data collection. There is a growing emphasis on the use of non-survey data sources, as these often represent sources of data which can be realized more quickly and at lower cost. The model can be extended by adding further sub-classes to represent other sources, as required.
116. Two common forms of data collection are the use of data from administrative registers, and the collection of data by programmatically "scraping" web sites for their content. To reflect this, GSIM models two non-survey data sources - Administrative Registers and Web Scraper Channels. It also models one survey data source - Questionnaire. The following sections describe how each of these is modelled in GSIM.
A: Administrative Registers
117. In the illustration below, we show how GSIM can model administrative registers as data sources. The sub-type of Exchange Channel which represents administrative register as a source is the Administrative Register object.
Figure 23.Administatrative Register
118. The important information about Administrative Registers includes:
- the agreement between the statistical organization and the provider of the register data,
- the protocol for accessing the data, and
- the structure of the data to be received.
119. Each of these can be described by information objects which are inherited from the Exchange Channel by the subtype Administrative Register.
120. The agreement between the statistical organization and the Information Provider is represented using a Provision Agreement object. This shows the relationship between our Administrative Register and the Organization with which the agreement exists (the Information Provider). There is typically an agreed structure for the data - described in the data collection agreement - but this can sometimes be different from the structure of the data actually received. The Information Provider object has a relationship to the Data Structure object. This represents the agreed structure of the information to be collected from the administrative register.
121. The Administrative Register object also inherits a relationship to a Protocol object from its parent Exchange Channel. The Protocol object captures the details of the technical process by which the register data is to be collected. This might be through the use of a standard mechanism such as an SDMX data exchange, a technical mechanism such as a query to a database, or even a manual process.
122. The Exchange Channel object also allows for its Administrative Register sub-type to link to the Data Set actually collected, which references its own Data Structure object. By comparing the collected data and its structure against the "agreed" structure, the received data can be validated. Note that if the information being collected were referential metadata, the Referential Metadata Structure object would replace the Data Structure in the diagram, and similarly the Referential Metadata Set would replace the Data Set.
123. Once a Data Set is collected, we have all of the usual objects such as the Data Point, the Instance Variable, and so on. As more Data Sets are collected, these can in turn be stored in a Data Resource, which would hold all of the data coming from the Administrative Register over time.
B. Web Scraping for Data Collection
124. The second non-questionnaire data source to be modelled is a web scraper, as seen in the diagram below.
Figure 24.Web Scraping Channel
125. There will be at least a notional Provision Agreement between the statistical organization collecting the data through the web scraper and each of the organizations whose sites are being scraped (the Information Provider), even if this is only the terms and conditions of accessing the data provider's website. In many cases, Internet robots used to do web scraping are blocked from websites, and there is typically contact between the scraping organization and the data providing one, to make sure that access is not blocked, and to know when the website's structure might change.
126. Although perhaps trivial, the Protocol being used will need to be recorded, being either HTTP or HTTPS (by definition, the scraping tool is operating on the web).
127. Each website is scraped using a software application. Due to the varying structure of different websites, often a different software tool will be needed for each website. Further, every time the website being scraped is structurally modified, adjustments may need to be made to the software tool. The software tools themselves are represented as Process Steps in GSIM, these being the result of a design process administered through a Statistical Program, which are capable of being executed to programmatically collect the data.
128. The management of the mappings between each website and the software tool used to scrape it is important information to capture. It is necessary to be able to describe the software tools used to scrape websites, and their link to the websites for which they are designed. This is done using the Scraping Process Map object. This object links a Process Step and one or more Information Providers (the organizations whose sites that software tool can scrape). A set of these gives the links needed to manage the mappings between the web scraping tools and the sites from which the data is collected.
129. As for the Administrative Register above, the structure of the data to be collected and the information regarding the actual data collected are captured in the Data Set and Data Structure objects.
C: Survey Data Collection
Figure 24. Questionnaire
130. Although more and more alternative data collection methods (such as Administrative Register sources) will be utilized by statistical organizations, it is envisaged that for the foreseeable future, surveys will continue to make extensive use of questionnaires for the purpose of data collection. As such, Questionnaire is included in GSIM as a subtype of Exchange Channel.
131. The Provision Agreement establishes the relationship between the Questionnaire and the Information Provider, in the form of some agreement to provide data to the collecting organization. This is sometimes (especially in the case of collections for official business statistics) a mandatory requirement specified by law.
132. Depending upon the survey it will be used for, the Questionnaire could be developed as one or more generic types. Each instance of a Questionnaire will be constructed by reference to the Questionnaire Specification. A Questionnaire could take the form of a standard Questionnaire Specification (i.e. the layout would be the same or have a relatively small number of variations) for a particular survey, or at the other extreme, the Questionnaire Specification could be tailored to each Information Provider (or Unit) selected for the survey.
133. The Questionnaire Specification will consist of a top level Questionnaire Component, which will itself be made up of lower level Questionnaire Components, built up in a hierarchical manner. Each Questionnaire Component will in turn be made up of a number of Instance Question Blocks, Instance Questions, and Instance Statements.
134. In its simplest form, a Questionnaire Specification would have a single Questionnaire Component made up of a number of simple Instance Question Blocks, Instance Questions and Instance Statements, but will also have associated Questionnaire Logic, which will govern the navigation and validation of Questions and responses within the Questionnaire Specification. The Questionnaire Logic will implement a number of Rules, which will carry out such work as the evaluation of the response data in terms of the range of acceptable values. In most cases, the Questionnaire Specification will be built up using of several Questionnaire Component levels, each with their associated Questionnaire Logic.
135. Question Block, Question and Statement are reusable artifacts, which will be implemented in the Questionnaire Specification by means of the Instance Question Blocks, Instance Questions and Instance Statements respectively. It might be that the actual Question Blocks, Questions and Statements would be stored in some searchable library for use during the Questionnaire Specification development process.
136. Questions can take the form of a multiple question item, and can be hierarchical. Question will have a connection to one or more Variables, and will also be associated with a Value Domain, specifying the constraints of the values which can be assigned to the Variables in the response to the Question.
137. Different Protocols (modes of collection) would require different implementations of Questionnaire. For example, if a Questionnaire Specification is designed for collection via a web page, a similar Questionnaire Specification design containing all the same Question Blocks, and Questions but is intended for collection via a printed paper form, it would be implemented in a different instance of Questionnaire. Thus, where a multi-mode data collection strategy is adopted for a survey, separate Questionnaire Specifications would be needed to be developed for each Protocol (mode) employed, and they would be implemented in different Questionnaire instances.
138. The navigation and validation aspects within the Questionnaire will need to be designed with the Protocol (mode of capture) in mind. For example, if the Questionnaire is to be rendered as a paper form, then the navigation will be implemented using an Instance Statement in the form of a text instruction to the Information Provider such as "If the response to gender question is 'MALE' then go to question X". If a similar Questionnaire were to be rendered as a web form, then the navigation could be automated and the Information Provider would be automatically routed to 'question X'.