Cube Dataset & Data Structure models
Examples of Cube Datasets
- One single figure, e.g. the total population of the Netherlands on January 1st, 2012
- The total population of the Netherlands, broken down by sex: two figures, no total.
- One level only, e.g. the total population of the Netherlands on Jan 1st, 2012 by Province (but without the total number for the whole of the Netherlands
- As 3, but including the total for the Netherlands
- One of the cubes from the set of (60) cubes of the Census 2011. Ref Regulations (EC) No 763/2008 and 1201/2009
- The Statline table “Population, key figures” as shown in the Annex
- A cross-section or “slice” out of the Statline table, e.g. only the time series showing the “green pressure” percentages by year.
In the IDP specification team we agreed that the whole set of 60 cubes of the Census 2011 is not considered a Cube, but a set of (related) cubes. The specific relation between cubes we did not model, but other constructs within GSIM may be used for that purpose.
From the explanatory text of the information object Unit Data Point:
For example (1212123, 43) could be the age in years on the 1st of January 2012 of a person (Unit) with the social security number 1212123. The social security number is an identifying variable for the person whereas the age, in this example, is a variable measured on the 1st of January 2012. The value can be obtained directly from the Unit or indirectly via a process of some kind.
Where is the information “on the 1st of January 2012” held in the model? I.e. on which information object, which attribute? Is it part of the definition of the (Instance, Represented) Variable? Or is it an attribute of the Dataset or a DataAttribute of the UnitDataPoint itself? What if this DataPoint becomes part of a larger Dataset that contains data about other moments in time, like a time series?
Some questions here are:
- What happens (to the metadata) when a Dataset is split (by selection)?
- What happens (to the metadata) when a Dataset is combined with another Dataset (join, append, etc.) where DataPoints do not “overlap”?
- What happens when two Datasets are combined that contain the same DataPoints (overlap), but different quality or different source? (I.e. same id, same measure, different Data Attribute)
- What happens (to the metadata) when data is aggregated?
Analysis of the examples
In example 1, there are no (apparent) dimensions, there’s just one measure, and one cell (Data Point): the total number. This cell is “identified” by the label on the dataset, which suggests that in fact this “cube” is a selection or cross-section from a larger cube having dimensions: region (selection: “The Netherlands”, level: “Country”) and Time (selection: “Jan 1st, 2012”). It also suggests that it is an aggregate (“total population”), which prompts the question about underlying details such as given by the next examples.
Example 6 shows that the data may be percentages (e.g. “demographic pressure”). (So what …..?) and that the various “topics” do not seem to have a common underlying level in the Dataset. Which makes the dataset just a convenient “packaging of data” that describe related phenomena rather than a structural and content-wise coherent set of data.
The Census cubes seem to be structured more or less as envisioned by the GSIM Cube Data Structure. But the GSIM Data Structure does contain nothing about the value domains of the data (neither dimensions, nor measures). Neither does it tell anything about the leveling of the dimensions. The GSIM Data Structure therefore is not sufficient for constructing a query other than “give me the whole thing”, or best case: give me the marginals with this dimension “left out” (which may be interpreted as “totaled” or rolled up using some other operation e.g. averaging).
Microdata vs Aggregated data
Using the following example, it can be seen that from the column structure alone, one cannot (always) make the distinction between microdata (describing units in a population) and aggregated data (summarizing data about a population on some higher level of granularity).
The table on the left is in fact a microdata table giving details about individual companies, whereas the table on the right is aggregated data giving the average turnover per company by activity and by size class.
In order to say something about the nature of the data in the dataset, information is needed above and beyond the column structure (i.e. the variables) of the dataset. One needs to know what is represented by the rows (records). In other words, one needs to know the domain of the function that resulted in the image represented by the dataset.
The question is whether this type of information is to be considered part of “structural” information (and therefore needs to be part of the GSIM Data Structure). Since it is necessary for a correct interpretation of the data, we would strongly argue that in fact it is. However, although we feel we are close, at this point in time we do not yet have a satisfactory modeling solution for this problem.
Why alternative models?
The reason I believe there is a need for another way of modeling is that I am not satisfied with the current information models. In my opinion, the current models (Cube Dataset and Data Structure)
- Do not reflect the true nature of a Cube, that is structured for slicing and dicing, drill-down/up and pivoting, let alone for (further) aggregation (roll-up). For instance, the current Data Structure does not contain the information necessary for the construction of queries against the Cube. It does not recognize the leveling of the dimensions and does not recognize the fact that a cube usually contains multiple levels (aggregates and marginals) in one dataset.
- Do not specify the “boundaries” of the data contained within a Cube Dataset. A selection out of a Cube Dataset may have the same Data Structure as the cube that it was taken from. A DataFlow that exposes a time series has the same structure as the individual Datasets that contain the time slices that are produced one by one over time and gradually “added” to the DataFlow (or maybe a new accumulative time series Dataset).
- Do not show or explain the relation between micro-data and aggregate (by which I do not mean the process model or even aggregation operations but the structural information such as: this population figure was aggregated from this micro data. Conceptually, the relation is there, even if the data is kept separated in different datasets).
- Do not recognize the fact that an aggregate may be considered micro data by the user (each of the examples above may be considered micro data or cross-sectional, partial cubes by Eurostat).
- Show, but do not explain the need for, a difference in structure between Cube and Unit data.
The alternative models
Some of the shortfalls of the current model may be only due to the way we constructed the views. There are relationships in the GSIM model that allow navigation to other types of meta information, but this additional data is apparently not considered part of a Data Structure.
This leads me to the following questions
- What is in fact a Data Structure?
- What is the purpose of a Data Structure?
- What does it need to contain in order to fulfill that purpose?
One of the things I believe a Data Structure must allow the user to do is to understand the data sufficiently to be able to construct queries against it. Queries are specifications for sub-sets of the data contained in the Dataset. For a user to be able to specify such subsets, he/she must know the values (categories) of the various levels of the dimensions. If he only is interested in a part of the data, e.g. a certain period or time slice or a certain regional cross-section, he must be able to indicate the slice he wants by the correct values. These are categories in the underlying levels (sub-aggregates or micro data).
The use of aggregated, dimensional data as microdata by a user of a dataset prompts the question whether that user is still considering the data in the same context as meant by the supplier. We tend to believe that there is a change in population. Example: data given as the summary of population of the Netherlands (a set of persons or households) may be used by Eurostat as information about the Netherlands i.e. properties of a member of a set of countries. This is a more general observation: users and suppliers may have different viewpoint with respect to the data and the question arises whether that has an impact on the way the user perceives the structure of the data. We tend to believe that such a change in view of necessity brings about a transformation of the structure and of the data itself.
An attempt at definition
A Cube contains data pertaining to a Population. A Cube has dimensions. Each dimension is associated with a Classification Scheme or possibly a Classification. A Classification has Categories that are organized into Levels. Each Level contains a non-overlapping (disjunctive) and exhaustive set of Categories.
A Strict Cube is a Cube that contains data for a crossing of dimensions where each dimension is restricted to just one Level of its Classification Scheme.
A Marginal is a Strict Cube that has at least one dimension whose Categories belong to a Level higher than that of the underlying Strict Cube. A Marginal therefore is a role, it is a kind of relationship between two Strict Cubes.
A Primary Marginal is a Marginal where there is just one dimension that is “rolled up” into its next higher level. A Higher Level Marginal is a Marginal where one or more dimensions are rolled up to a higher Level in de Classification.
A Strict Cube has as many Primary Marginals as it has dimensions.
A Strict Cube may be Primary Marginal with respect to more than one underlying Strict Cube (see figure).
A Strict Cube and all its Marginals are defined on one and the same Population, although described (summarized) on different levels of granularity.
Question: If a dimension is rolled up to its highest level (“Total”), does it disappear from the Strict Cube? Such a dimension cannot be rolled up any further.
A Cube Dataset may contain many Strict Cubes, that may or may not be related through Marginal relationships.
Question: Is there any reason to make a distinction between (a) Cube Datasets that contain Strict Cubes that are all related through Marginal relationships (whether present in the Dataset or not) and (b) Cube Datasets that contain Strict Cubes that do not (all) have such relationships? Is there any reason to forbid the second type of Cube Datasets (a more or less “random” packaging of Cubes)?
Remark: The Statline Table shown in the annex shows a number of Marginals, but the underlying Strict Cube is not part of the Dataset. So, although the Marginals are all defined on the same Population, the relationship in the data is lost and may only be known through the metadata. This example is still considered to be of type (a).
Assertion: A selection from a Cube Dataset is again a Cube Dataset. (Proof?)
Assertion: A selection from a Strict Cube such as a cross-section results in another Strict Cube, but changes the Population that is being described. (Proof?)
Question: Does a selection across the Time-dimension constitute a “break” in the sense that the Population is changed? If not, in what respect is the Time Dimension different from other types of Dimensions?
Within one Cube, a dimension may refer to multiple Classification Schemes under one Classification. It must be clear which Classification Scheme is valid for what part of the dimension. Usually, the validity is related to the time dimension, which creates a relation between dimensions.
Remark: Contrary to the proposal above, it may be helpful to define a “supercube” as the crossing of all dimensions, across all levels, and define the strict cubes and Marginals as selections from this supercube.
A micro data (Unit Data) record contains an identifying key (possibly consisting out of more than one element or variable), one or more classifying categories (dimensions) en zero or more characterizing variables (measures).
Examples of data in the categorizing variables: date of birth, sex, address
- Summary variables in the aggregates are different from the variables in the unit data. Income of a person or household is different from the average or total income of a sub-population (persons aggregated by region or by age class), even if their value domains are the same.
- Each aggregation is the mapping of a variable on the lower level on a variable on the higher level, a function.
- Categorizing variables (dimensions) in the aggregates are different from the corresponding variables in the micro data. Each level in a classification has its own value domain (the set of categories for that Level). E.g. an address is not the same variable as the region in the various levels (each with its own set of categories) of the region classification scheme. And a date of birth (or even the age of a person) is not the same as an age (class)?
- As a result, a dimension is not the role of a single (represented) variable, but in fact a set of (represented) variables, one for each level in the classification scheme.
- When counting units (number of) in an aggregate, in fact a new variable is introduced, although it can be reasoned that this variable on the unit level is already implicitly present, as a characteristic of the unit.
The identifying part of the unit data disappears in the aggregates. The unit is “assimilated” in the sub-population. The unit-key is only relevant for the coupling of micro data (linking distributed parts of the logical unit description). Note that distinguishing one unit from another is not the purpose of a key, a key is only used for identification of a unit. A record separator (e.g. CrLf) is sufficient for distinguishing one record from another.
Ref the question about top level dimensions that seem to disappear out of the Cube.
Consequences for the GSIM model
- A Variable can be defined as a mapping, a function (ref “The Organization of information in a Statistical Office”), and is therefore associated with both a input (domain) and an output (codomain) (NL: resp. “Domein” and “Bereik”). For micro data, the domain is the (set of Units within the) Population, the codomain is the Value Domain. For an aggregate, the domain is the Value Domain of the Represented Variable on the lower level (either microdata or aggregate), the codomain is the Value Domain of the aggregate Represented Variable itself. The relationship with the underlying Population that the aggregate describes therefore is an indirect one, but still part of the definition of the Variable.
- A dimension is not the role of a single Represented Variable. A dimension is the application of a Classification Scheme for the purpose of summarizing information about a Population. A dimension has as many Represented Variables associated with it as the Classification Scheme has Levels.
- A measure is not the role of a single Represented Variable. A measure has as many Represented Variables as there are separate Marginals associated with the lowest level Cube. Each Marginal has its own unique set of Represented Variables. Not necessarily because it has a different Value Domain (avg age in years of persons in the Population is still a positive real or integer number), but because of the mapping.
- The Represented Variables associated with a dimension or a measure are all based on the same Variable, on the same Concept (and on the same Population). It therefore seems unavoidable to include the Variable in the Data Structure. If the use of Represented Variable for dimensions still is deemed necessary, their Value Domains must be defined on the sets of Categories in the Classification Schemes (the Levels).
- The scope of Data Structure must be extended to include all object types and relations that are necessary for the interested parties to be able to understand (in detail) the structure and content of a dataset. The Data Structure must be sufficient for someone interested to construct a query against the data without having to rely on other kinds of information.
- The Data Structure should be defined in terms of Strict Cubes and Marginals. For each dimension, it should include Classifications, Classification Schemes (with indication of validity of each Scheme in relation to the part of the Dataset for which it is valid).
- Unit Data and Cube Data should be related in a more functional way. Also, as a structure, Unit Data becomes a special case rather than a separate case.
- UML diagrams still to be developed. Some of the concepts described before may turn out hard to model (at least I do not see a simple solution) ….
- A Value Domain is associated with a Concept once it is applied to a Population Type. Example: Sex of a person vs gender of a noun (NL both “geslacht”) or age of a mayfly (NL: “eendagsvlieg”) vs age of a person vs age of the universe.
How to deal with “ragged” classifications, i.e. classifications where certain branches have “missing” levels. Example: countries with and without ‘state’ level.
A special case of the is the combination of cells due to statistical security (non-disclosure)
- How to deal with ‘shared’ dimensions, i.e. dimensions that are used in multiple cubes and datasets. Common examples are: Time and Region. These have been standardized in Statline. Ref Wikipedia: ‘Conformed Dimension’
- How to deal with Measures that cannot be aggregated for all dimensions? Ref Wikipedia: Non-additive or Semi-additive measures.
- How to deal with parallel classifications (sometimes considered non-hierarchical classifications) like region classifications where (at least in the Netherlands) water board districts often cross province boundaries.
On the nature of data
A number of review comments on GSIM V0.8 have to do with the distinction that has been made between Unit data and Cube data, where the first has been associated with microdata and the second with “aggregated” data. It is contended that non-aggregated (micro) data can be structured as Cube data. It seems therefore that there is a distinction between the nature of data and the way that data is structured or represented.
In this chapter we discuss the true nature of data and thereby try to discover the inherent similarities and differences between aggregated and non-aggregated data.
Non-aggregated (micro) data
- Is about individual units in the population
May even be about still smaller pieces of information than we are actually interested in, like individual sales transactions where we want the turnover (in a certain period). Adding up or otherwise calculating the observation value we are interested in is not considered “aggregation” (but how do we call this, then?). Ref Wikipedia: Degenerate dimensions.
(Is the more detailed data about a different Population, different Unit or just multiple events or properties of the unit under observation?)
- Is about clusters of units, sub-populations (defined by the Cartesian product of dimensions)
- Is summarized data, calculated (estimated) from the unit observations in the cluster
- therefore, by necessity, needs dimensions and classifications. Only microdata that contains categorical variables can be aggregated.
A population is defined by describing the properties that distinguish members of the population from those outside. Usually, the definition includes properties that members must have, but non-members must not have.
As all members of a population have these defining properties in common (otherwise they would not be a member), it is normally not deemed necessary to include these properties in the description of individual members. However, if a population is divided into sub-populations, the sub-populations are defined using properties that the members of the original population have, and that are available in the descriptions of each individual member. But as soon as the sub-population is formed, the particular properties that are added (as additional discriminating properties) to the definition of the sub-population, are no longer relevant for describing the members. Which means that these properties move from the unit level to the population level and are included in the (meta)data of the population.
This “moving up” of properties is confusing. It makes things difficult, for instance when taking together datasets that describe different populations to form a new population. Example: collecting data from different regions (countries) into a dataset describing a bigger region. Now all of a sudden a property of the original populations needs to “descend” from the definition of the population to the description of the individual units.
Usually, the properties in question are part of a classification. In the sub-population, the classification used is “partial” in that it ignores the higher levels, those describing the world outside of the population of relevance.
In order to tackle this “problem”, maybe classifications need to be extendable, and the definition of population based on the dimensions and their classifications?
The structural dimensions of a cube have to do with the nature and meaning of the data. In addition, there may be reasons to introduce additional dimensions for other reasons. For instance: Measure dimensions and Attribute dimensions. These are for physical efficiency purposes only and have no place in a conceptual model. They do not play a role in identifying the cells in the Cube. A Measure dimension is often used in a relational star schema, in order to handle a varying number of Measures in a fact record. It causes the different variables to be handled as key-value pairs. Drawback is the duplication of the foreign keys (dimensions) for each value. An Attribute dimension is used for the same reason and helps distinguishing the attribute from the measure.
In SDMX, it is common practice to use a “Measure dimension” in cases where there is more than one Measure in a dataset. This may be seen as a simple trick to handle a varying number of measures (a problem in relational databases), by exchanging “columns” for “rows”. This has no impact on the meaning and value of the data, but in fact it does change the structure of the dataset. The resulting “observation value” column becomes a very awkward type of variable, since data type and unit of measure may vary with the value of the category in the Measure dimension. Like SDMX, the current GSIM does not dictate this kind of usage, but neither does it warn against it.
It is clear that there is a practical need to be able to attach additional information to the data in datasets. The current GSIM deliberately handles this in a different way than SDMX does. But the current model is too vague, leaves room for incorrect implementation and mis-use GSIM should be modeled to promote a correct and standardized way of handling attribute information. This includes being able to determine the correct type and meaning of the each attribute. In particular, GSIM should prevent against Attributes being used as Measures and vice versa. By not being strict about this, GSIM may lead to different practices and thereby hinder in stead of help. It should be made clear what areas GSIM is in danger by being obscure.
Annex: Example from Statline (Population, key figures)
The following screenshot shows an actual Statline table, as shown to the user on the screen.
This may be more a presentation issue, the separation of “Topics” and “Periods” does not (necessarily) reflect the “measures” vs “dimensions”. “Topics” are not “measures”.
Surprisingly, this table does not give any regional breakdowns …
 Taken from: Tjalling Gelsema, The Organization of Information in a Statistical Office, Journal of Official Statistics, Vol. 28, No. 3, 2012, p. 413-440.