The GSIM v1.0 Production group has a mismatch between the number of subtypes for Process Input and Process Output.

Process InputProcess Output
Parameter InputProcess Metric
Transformable InputTransformed Output
Process Support Input?

Separately, at the last GSIM IG meeting, the issue of a missing process output object was also raised. It was felt that there is a need to have an object to allow for output objects that are neither Transformed Outputs (e.g. statistical data or metadata) nor Process Metrics (e.g. measures of the performance or effectiveness of the process step), but more the types of outputs that could be future Process Support Inputs

"56. A Process Support Input influences the work performed by the Process Step, and therefore influences its outcome, but does not correspond to a Parameter Input or a Transformable Input. Examples could include:

  • Code List which will be used to check whether the codes recorded in one dimension of a dataset are valid.
  • An auxiliary Data Set which will influence imputation for, or editing of, a primary dataset which has been submitted to the process step as the Transformable Input."

This could be accommodated within the v1.0 style framework by adding a new object, Process Support Output, which would be the output analogue to Process Support Input.

In the "GSIM Level 2" proposed for discussion, IMF contributors dealt with this issue somewhat differently. The concept of Process Metric is replaced with a concept of Process Measure, which is defined as recording metrics associated with the Process Step and logging work and knowledge objects used in the Process Step and work and knowledge objects created by the Process Step. This model assumes that the "work and knowledge" outputs created need not necessarily be transformations of the transformable inputs, but may be some type of reusable or otherwise informative byproduct. These byproducts may fit an existing structured object from elsewhere within GSIM, or may be unstructured explanatory information (which would fit within the Referential Metadata object in this model).

Arofan also had a number of ideas on the call and offered to comment on this issue.

4 Comments

  1. user-8e470

    From Arofan:

    At the CSPA sprint it became clear that we are missing some needed granularity regarding the outputs from various processes (in this case, editing processes). The specific issue was that a “data set” was created by a process which identifies cases which are implausible, and which may then be subject to inspection or even imputation. The main data set is not edited by such a process – what is generated is a report which only identifies cases and variables which are in violation of some validation rule. We have no way to distinguish the report from the actual data set itself, which in this case is not transformed, only inspected. But from a GSIM perspective, the report is another data set (it is not a process metric).

    This also emphasizes the problem with units being attached to variables. The unit for the main data set is individuals, households, etc. – normal stuff – but the unit for the report “data set” are the variables in the main data set. This doesn’t really match the definitions which we have (it’s a stretch, anyway). We will want to look at this part of GSIM more closely, I think.

  2. Summary

    Adding Process Support Output would change the general concept of Process Inputs and Process Outputs within GSIM.  This would be a major change, and should only proceed if absolutely certain it is appropriate and necessary.

    It appears more likely

    1. The current three way subtyping of Process Input and two way subtyping of Process Output should be removed – leaving only Process Input and Process Output.
      • The current subtyping has some ambiguity and is not sufficient in its own right to describe the specific role a particular GSIM Object has in regard to a particular Process Step.
      • It may be that distinctions regarding the roles of inputs and outputs in regard to a Process Step are more appropriately handled through attributes rather than the current subclasses.
    2. The general concept of Process Inputs and Process Outputs as currently modelled in GSIM needs to be more clearly explained and illustrated in the V2.0 documentation.
    3. The use of “Transformed” in “Transformed Output” is causing some confusion and, if any subtypes remain, this subtype should be renamed.

    Early discussions within the ABS would favour (1.) (removing subtyping) rather than (3.).

    Background

    Process Inputs and Process Outputs can be thought of as characterising the relationship an object defined elsewhere in GSIM has with a particular Process Step.

    • In the design of a particular process step, the type of GSIM object required as an input, or produced as an output, will be specified.
      • For example, it might be a dimensional dataset that conforms with a particular structure definition (or it might be “any dimensional dataset”)
    • For each instance of the Process Step being executed, a particular object (rather than “type of GSIM object”) will have been input or output
      • For example, a specific dataset will have been input and/or output.

    An alternative way of thinking about it may be that Process Inputs and Process Outputs describe the “slots” things to go into, and come out of, in regard to a Process Step.  At design time we specify what type of thing can go into each (input) slot and come out of each (output) slot.  At execution time we record what actually goes into, or comes out of, each slot.

    A Transformed Output (maybe a poor name) is basically an output that forms part of the reason for existence of the particular Process Step in the first place.

    A Process Metric, however, measures/records some aspect of the performance of the Process Step.  A (rational) user would not define a Process Step, however, simply to measure its performance without it producing anything of value.

    (Note: If Process Step Y were “Tune an existing Process Step X” then the Transformed Output from Y might be analysis of how the Process Metrics produced from Process Step X vary under different conditions.  Process Step X, however, is about producing an intermediate or final output required by a statistical production process rather than about producing Process Metrics.)

    Concern about Process Support Output being added as a subtype

    The concern about adding Process Support Output is that if the current Process Step (eg X) has produced the output then it will be, eg, a Transformed Output with respect to Process Step X.

    Later on, the same GSIM object might be used as a Process Support Input for Process Step Y (and also a Transformable Input for Process Step Z).

    If we want to understand the flow (lifecycle and usage) for a particular object then we look across its use in various Process Steps.  For example, if we are interested in Object 123 then we note it was a Transformed Output from Process Step X and used (in different ways) by Process Steps Y and Z.

    I am not sure there would be a case for including in GSIM a new object which represents the usage of a particular GSIM object across all Process Steps.  As noted above, at the implementation level you could readily get the information by querying a registry which contains GSIM based metadata related to all process steps.

    Arofan’s example

    This example shows usage does need to be clarified, particularly in regard to Transformable Input.

    Take a very simple case (not Arofan’s) of a dataset that is going to have values “auto-edited” in some way within a Process Step.

    The dataset (D1) that contains the values to be auto-edited would typically be the Transformable Input.  The dataset (D2) that contains values after auto-editing would be the Transformed Output.

    D1 and D2 in this case would be expected to have the same Structure Definition.  Whether D2 is considered to “be” D1, however, may be an implementation issue.  At a minimum D2 might be considered a new version of D1.  In other words, D2 is the “transformed” version of D1.

    Under some circumstances, however, an implementer might want to consider D1 and D2 to be completely separate datasets rather than one being an updated version of the other.  As far as I know, that different implementation approach would also be entirely consistent with GSIM.

    If D2 were to have, added to it, a set of added attributes which recorded the extent to which each value had been edited, however, then D2 would quite possibly have a different structure definition to D1.  Under most interpretations, then, D2 could not structurally be a “version” of D1.

    Most people, however, might still consider D1 a Transformable Input in the above scenario because it fairly directly provides the “basis” for the Transformed Output.

    Cases where D2 is “more and more” different to D1 can be envisaged.  Eventually D1 is still influencing what D2 looks like but D2 is less easily recognised as a “transformation” of D1.  At some point along this spectrum, D1 probably starts being described as a Process Support Input rather than a Transformable Input.

    In Arofan’s example, I would tend to consider the “main dataset” a Process Support Input.  It is a spectrum, however, because the cases in the Transformed Output will be drawn from the “main dataset”.

    If a version of last month’s “main dataset” were also being provided to help identify anomalies then it would also be a Process Support Input.  The relationship of this month’s “main dataset” and last month’s main dataset to the content of the Transformed Output are (probably) not the same.

    As a lot of the above discussion seems to be about implementation and interpretation, maybe the sub-typing of Inputs and Outputs is actually a step too far?

    I am not sure, however, why Arofan says it is not possible to distinguish the “main dataset” from the “report”.  Firstly, I am not certain the “report” will be an input, unless the process is updating an existing report.  (The structure definition for the report might be a Process Support Input?)  Even if the Report is an Input as well as an Output, it will go into a different “slot” compared with the “main dataset” (and it will have a different ID as an object).  This is another reason why the subclassing of inputs may not be a priority - for design and execution we still want to know more about the individual “slots”.  GSIM as a conceptual model should not provide a detailed catalogue of all the attributes used to describe a slot, but I think the concept of slots remains important.

    I have left Arofan’s point on units attached to variables (now Unit Types?) to a different discussion.

  3. Of possible relevance for Arofans 'report'?

    Evaluation Assessment 

    Business 

    A type of Assessment that evaluates the process outputs of a statistical activity based on a formalized methodological framework.

    The evaluation can be done in regard to various characteristics of the output, for example its quality, the efficiency of the production process, its conformance to a set of requirements, etc. The result of an Evaluation Assessment can lead to the creation of a Statistical Need: in this case, the Statistical Need will reference the Evaluation Assessment for traceability and documentary purposes.

  4. user-8e470

    Discussion 27/8:

    Remove the subtypes - we have Process Inputs and Process Outputs. Add a type attribute that can be any other GSIM objects or something else (e.g parameter?) We need some examples in text to help people, perhaps from architecture work?

    Current subtypes are causing confusion. A number of examples of this happening - IMF, Metadata Flows, Architecture project.