Issue #4: Represented variable

A question on how to interprete Represented Variable:

Represented Variable is defined as "A combination of a characteristic of a population to be measured and how that measure will be represented" with the example "The pair (Number of Employees, Integer), where "Number of Employees" is the characteristic of the population (Variable) and "Integer" is how that measure will be represented (Value Domain)." If your variable is "Industry" and your valuedomain is "Level 1 of NACE 2007", will the pair then be (Industry, alphanumeric) or (Industry, NACE 2007 - Level 1)?

No labels

3 Comments

user-8e470
Meeting 10/1:
Maybe add a new example for represented variable to help. We think the current example is not helpful.
INSEE thinks the second option given by Anne Gro is right based on how they have implemented it. It is a reference to the subset of the codelist.
ACTION: Dan Gillman - to provide new example. Guillaume Duffes to send email to Dan.
- Permalink
- 10 Jan, 2018
user-8e470
Alistair Hamilton - we are still looking for a better example. Is there something you could propose from real ABS use? Thanks
- Permalink
- 06 Jul, 2018
Alistair Hamilton
Hi Thérèse Lalor.
ABS has exactly implemented Anne Gro's second possibility by including the ability to reference specific levels of a node set within an Enumerated Value Domain (EVD) for an RV.
In regard to the current example more generally, one of the reasons it is not as simple as you'd first think to give a "universal" example of an RV is how they tend to be used for different purposes in different stages of statistical production.
A trivial sounding example for us would be a standard RV "Sex of Person [1,2,3]". In this case we have the Variable (Sex of Person) and the representation (1=Male, 2=Female, 3=Other). If we actually had a case where we wanted to hold in the database the alternative coding [M, F, X] as Code Values then we'd have a separate RV, using the alternative representation, from the same Variable.
One slight hesitation on the wording of the current example is that at the "unit" stage we actually want to record Sex of Person for persons in a wide range of populations. For some of the core (people or business) demographic RVs we want to record the information for any unit of the relevant Unit Type regardless of what populations specific surveys target (as long as the population is composed from units of that Unit Type).
In other cases some aspects of the population specification influence our choice (eg we do not collect "Industry of Main Job of Person" for unemployed persons, or persons not in the labour force, but we would collect it for "employed" units in a range of populations that differ in terms of other aspects of their specification).
At the unit level each individual record is showing, for the "Sex of Person [1,2,3]" RV, the encoded measure of that characteristic for that unit.
(If we are not measuring the characteristic for the unit itself but, eg, "Sex of Partner of Person [1,2,3]" or "Sex of first born child of Person [1,2,3]" these are different RVs.)
At the level of aggregate estimates a RV does hold a defining characteristic for sub-populations - more similar to the current GSIM example.
So we can look at the unemployment rate (measure) for males of a certain age at a certain location in a certain reference period (three other identifying dimensions) vs females of the same age at the same location in the same reference period. At this aggregate level it also makes sense to be able to talk about the unemployment rate of "Persons" separately to the breaking down into rates for Males, Females and Others.
In the aggregate context, therefore, we tend to reference "total" codes with appropriate labels (eg "Persons", "All Ages" (or "All Ages 15+"), "Australia") as part of our RVs. (We don't for reference period because we don't believe we're yet at the end of days ).
An example that others (eg INSEE) may handle more neatly than ABS is
Here the unit is the business that has employees and currently ABS would see "Taxable Weekly Earnings of Male Full Time Employees of Business" and "Taxable Weekly Earnings of Female Employees of Business" as two RVs with Numeric Value Domains (NVDs).
At the aggregate estimate level we will see Average Weekly Earnings split by Males vs Females (as an RV).
In other words, at the point of creating estimates we basically transform from 2 RVs with NVDs for data acquisition to 2 RVs (1 EVD and 1 NVD) for dissemination.
The above reminds me the RVs with NVDs also need more detailed representation information similar to Anne Gro's second take. For us it would be important the two input RVs for Average Weekly Earnings identify they are being measured in $ (as opposed to, eg, measured in $'000 where we would expect the numbers as stored in the database to be 1,000 times smaller).
This is where the current example in GSIM in regard to number of employees may seem to have some ambiguity. It is a count, so we don't expect to see a Unit of Measure such as $ or $'000 or "hours" or "ha" as we see for other NVDs. "integer" is also perhaps implying that the count is expressed in single units (rather than, eg, Estimated Residential Population where our counts are expressed in '000).
Therefore the current example could be mistakenly read as simply a technical (eg RDBMS or XML) data type (where, in theory, I could use the same technical data type to store my whole number figures for taxable gross weekly earnings) even though for the current GSIM example it is probably intended to be a short hand expression of the "representation details" from a statistical perspective. A different RV with NVD example could be chosen such that this mix up is not possible due to Unit of Measure and/or Order of Magnitude considerations in regard to statistical representation?
- Permalink
- 10 Jul, 2018

Page tree

Issue #4: Represented variable

3 Comments

user-8e470

user-8e470

Alistair Hamilton