Added by Thérèse Lalor on 22 October


There has been much debate whether GSIM should make a differentiation between unit data and dimensional data. For GSIM v1.0, it was decided that the distinction was retained. There were two reasons for this:

  1. The distinction is important enough distinction to National Statistical Offices that it should be retained in the model. 
  2. The proposals for how to change the model were thought to be too big a change to take on in the timeframe available.

Here are some the relevant proposals and reviews that have been received on the topic.

GSIM Cube Dataset considerations.doc

Comments to GSIM-SDMX mapping_vdv.doc




14 Comments

  1. The following comments are specifically on the GSIM-SDMX mapping document.

    I agree with many of Vincenzo's observations but I am not entirely in agreement with the section that starts with :

    The additional GSIM purpose (in introducing the Logical Record class and in differentiating Unit and Dimensional data) seems to lie in the need of specifying that more logical records belong to the same IT object (the traditional dataset having more record types).

    I essentially agree with Vincenzo that each type of logical record could be expressed as a data flow.

    I think that by slicing a Unit Data Structure into pieces in this manner, however, some of the overall meaning and ""goodness" is lost.

    The Record Relationship between Logical Records is not discussed at all in Vincenzo's paper, but this is where I believe much of the "extra goodness" - and mathematical significance - "lives".

    Rather than talking in the abstract, consider

    http://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&20370_2006.pdf&2037.0&Publication&CACF387B87CE36F3CA2575B4001A2380&&2006&13.05.2009&Latest

    If one starts with Page 42 of the document, the Unit Data Structure has three types of logical record, Dwelling, Family and Person. In terms of the record relationships supported, it is possible to connect Person with Family, Person with Dwelling and Family with Dwelling.

    It is possible to ask all sorts of questions based on these relationships. For example,

    • What percentage of dwellings contain more than one family?
    • What percentage of children 5 years old or younger live in a dwelling with no car?
    • Broken down by place of enumeration, how many single parent families with an income of less than $1000 per week who live in a rented dwelling have
      • 0 children under 8 years old
      • 1 child under 8 years old,
      • 2 children under 8 years old
      • 3 or more children under 8 years old

    I am not sure how Vincenzo would characterise these as mathematical functions, but it seems to me they might be seen as based on mathematical functions.

    By the time you get to aggregate data presented in a dimensional form these rich unit based relationships between different record types are typically gone.

    The answer (the results set) for any of the questions I listed above could be expressed as a simple data flow. If you know in advance the set of questions people are going to ask in terms record relationships then I believe you can define a single logical record (it may be large and complex) in advance from which the answers can be sourced.

    Even with the relatively simple example of the ABS Census Data (I've seen a Unit Data Structure with nine logical record types and very complex relationships between them) , however, there appear to be an astronomical number of questions that could be asked.  It seems to me that a Unit Data Structure with Record Relationships best preserves the power and flexibility inherent in the data.

    It might be considered that the Unit Data Structure with Record Relationships is a more "internal" structure, where dimensional structures are typically used for dissemination. I suspect this is correct as a generalisation, although the ABS Census example is an example of a Confidentialised Unit Record File available for external researchers to query.  GSIM is intended to support end to end processing of data in any case, so if Unit Data Structures are more about internal definition, management and use of microdata then this is still in scope for GSIM.

  2. It seems to me that there are several issues that are mixed in the discussion re. Unit Data and Dimensional Data.

    First of all, there is the distinction between semantic structures and technical structures. In my earlier document, I tried, without explicitly calling it so, to model the semantic structure of a dimensional dataset (a hypercube). This is very different from a technical structure, that is more like the way SDMX and current GSIM do it. I would say that on the semantic level, there definitely is a clear distinction between unit data and dimensional (aggregated) data.

    Secondly, (and possibly less relevant to this particular discussion) there is a distinction between the structure of a logical dataset and that of a physical dataset. As an example, one could describe the (logical) structure of a complete hypercube, with all dimensions and levels, as opposed to the (physical) structure of e selection of that hyperclube, let's say a primary marginal. The same for the structure of a complete (logical) unit dataset vs the structure of a (physical) selection thereof. For example, a normalized relational structure in a database vs the extraction of just one of the tables (record type). The more "physical" the structure becomes, the more similar the structures of dimensional data and unit data tend to become.

  3. user-8e470

    Group Discussion:

    Theoretically and yes maybe physically (depending on how the data is stored), you don't have the distinction.

    Practically, you need the distinction for semantic reasons.

    You can describe microdata as a dimensionalised cube - but is that what you want to do? There is a lot of rich detail that would be lost if the models were collapsed.

    The group present agreed that the semantic distinction is more important. There are also organizational reason for having the distinction (e.g confidentiality). The accessibility of the model is also a key argument for keeping the distinction.

     

  4. A GSIM design principle is to keep it as simple and concise as reasonable possilbe. For this reason I have challenged the distinction between unit data and aggregate data. It boils down to two questions:

    1. Can we produce clear separate defintions?
    2. Does it matter for the design of statistical production processes?

     

  5. user-8e470

    Group Discussion:

    See comments for Wim above.

    The difference is between unit and dimensional views of data. There are things that you can do with unit data that is not possible with a dimensional view of the data (see Al's example above about ABS Census data). 

    Unit view is useful and necessary. Dan has an example about why this is so related to units in longitudinal data.

     

    Actions:

    1) Need to improve definitions of Unit and Dimensional data.

    2) Look at these objects in terms of metadata flows

     

     

     

  6. As I understood the discussion, the difference is not in the data structure, but in the way you view/use the data in the statistical process. The unit perspective looks record by record (horizontal), the dimensional view looks by variable (vertical). My appologies for the simplistic presentation, but it is reflecting my understanding. The same data can be viewed from both perspectives, and usually both perspectives occur in the statistical process (especially in the validation process).

    I wonder whether it is useful to have this distinction in our model, as in certain process steps both perspectives will have to be combined. In case you would aready have three perspectives (I need some time to imagine more).

  7. user-8e470

    2/7/13 meeting: Alistair Hamilton to progress forward

  8. user-8e470

    Alistair  has written a very comprehensive document on this topic. Text will (presuming it is agreed to) be very useful to include in the GSIM 1.1 specification document.

    Here are some snippets I have pulled out (but I recommend you read at the least to summary at the front of the paper as well):

    Having undertaken the analysis I am more convinced now, for conceptual and practical reasons, the current division is – on balance – appropriate.

    On the other hand, I will not assert there is an absolutely watertight and incontestable case for the division.

    It is proposed within the analysis that, not surprisingly, Unit data relates to individual Units.

    My recollection is that during the GSIM V0.8 to GSIM V1.0 process there was a common view that the distinction between Unit data and Dimensional data was not about microdata vs aggregate data.  Nevertheless, 75% of definitions in GSIM V1.0 related to DimensionalDataSets and DimensionalDataStructures mention the word “Aggregate”.

    For the purposes of this analysis, in line with long standing sources quoted below, I propose that Dimensional data refers to (sub) Populations, rather than individual Units.  When selecting this working definition, it is recognised that it is possible to have subpopulations that consist of zero or one Unit – or even engineer design of a DimensionalDataSet such that every subpopulation identified within the DimensionalDataSet is guaranteed to correspond to an individual Unit.

    It remains essential to recognise that in many cases the same “physical” set of data (eg codes and numbers stored in a relational database) could be viewed from both a Unit and a Dimensional perspective.  Para 94 of the GSIM V1.0 specification, for example, notes that “unit data” and “dimensional data” are different perspectives on data and that although not typically the case, the same set of data could be described both ways.

    Overall, these “edge cases” further highlight that the Unit vs Dimensional question is primarily about how we choose to “think about” and characterise a particular set of data, and not so much about the physical representation and implementation of the underlying data.  For me this highlights that the differentiation belongs in a conceptual (or, at a minimum, logical) characterisation of data rather than at the physical implementation level.

    If the basis for differentiation is broadly accepted, the GSIM Implementation Group may wish to consider whether the possible “tightening” of definitions for GSIM V1.1 can be pursued over coming weeks and during the Sprint.

     

     

  9. user-8e470

    16/10: Text forwarded to colleagues in Eurostat. Comment given tomorrow.

     

  10. user-b7160

    I also find Alistair's presentation of the issue very compelling.

  11. user-07a97

    At the previous Implementation Group call there was a discussion on simplifying the Data Structure model by removing some of the abstract classes. I offered to provide one. However, there is also a debate on whether unit record data and dimensional data can be combined in a single model. So, rather than remove these abstract classes I have used them to represent a simple model for data structure.  This  model supports both unit data and dimensional data structures. Note that I have used the current abstract classes to represent this but in reality these would be concrete classes and the current concrete classes to support unit data and dimensional data structures would be deleted.

    Alternatively, you could leave all the classes as they are and have this as an additional diagram that shows a high level "abstract" view of the two sub models.

    Note that I have added UML Constraints (text shown in in brackets on the association) to constrain the use of the association where it is only applicable to unit data. I have also removed the associations between Data Structure and the individual types of structure component (Identifier, Attribute, Measure) and made the association to Data Structure Component (this class is already in the model and the three types of structure component are already sub classes of it.). This makes the model a little bit more simple.

     

     

     

  12. I prefer not to vote, but to finalise the discussion. I think we have reached consensus on the fact that there is no clear cut distinction between unit data and dimensional data. Both perspectives can be valid, depending on the purpose of the anaysis or process. Technically it depends on viewing the identifier as an informationless number or as a classification.

  13. user-8e470

    Discussion 29/10:

    Vote shows majority of group are in favour of keeping distinction in.

    Al's proposal is unit data is for units, dimensional data is for sub populations. If we agree to this, then there are a number of changes to be made in the model. 

    We agree to the tenet of it and exact changes - in terms of relationships and definitions - to be worked out at sprint. For example we will need to review the relationship between datum and unit. We also need to tighten up all the definitions of dataset related objects.

    There is already some text in the spec document (para 94) which describes the perspectives issue - ie data can be both. This should be review to ensure it is adequate.

     

  14. user-8e470

    From Tjalling Gelsema:

    For those interested, even after the verdict, please find below my reaction to Alistair’s latest contribution to the discussion.

     

    I wouldn’t prefer an “incontestable case for the division”. There is however a scientific case for the following assertions: 1) there is a difference between unit data and dimensional data, and this difference does not rely on “perspectives”; 2) there are more kinds of data than those two, and; 3) all can be captured by the same model (without loss of expressiveness). See my 2012 article published in the JOS. If I’m not mistaken, I think that many of the arguments expressed there are in correspondence with the ‘functional’ arguments found in Vincenzo’s paper.

     

    So, I don’t think that “in many cases the same “physical” set of data […] could be viewed from both a Unit and Dimensional perspective”, at least not if those perspectives are supposed to be meaningful. The way I see it, it can be shown that in only some cases such a set could be viewed from both perspectives (viz. if that data set has only one column that could qualify as a dimension). Also, I don’t think that “the Unit and Dimensional question is primarily about how we choose to “think about” and characterise a particular set of data”: one of the main points of the JOS article being that dimensional and unit data are essentially (intrinsically) different.

     

    So, Alistair’s point of view seems to be that the difference lies ‘in the eyes of the beholder’ of the data set. That raises a question: why then does he need a “formalization”? I would think that any (natural, meaningful) formalization of the concepts of unit and dimensional data is in contradiction with the subjective approach to classifying a data set. The reason that, in this case, it is not a contradiction is that the ordered pair and the triple formalizing object and statistical characteristics is not a formalization at all: it is just an introduction of notation. Whether this notation is natural, convenient and suitable is not answered satisfactorily in the guidelines, nor is the formal meaning of the notation given.