A. Information Logistics Modes of Operation

45. Information Logistics has two basic modes of operation. Traditionally, NSOs have gathered all the data they needed for statistical production, either capturing it via questionnaires and other collection instruments or acquiring it from other organizations (admin data) which is then physically stored on the NSO’s premises. In other words, data used to be moved to where the processing capabilities were. More recently, with the advent of Big Data, cloud platforms and IoT, NSOs have started to consider more distributed data and processing approaches in which data stays in place (where it was produced, or in the owner’s environment) and the processing is moved to where the data is. The reason for this change is not only technical, e.g. minimize data movement, but also contractual, political and regulatory, e.g. privacy concerns, legal implications, national and transnational regulations, etc. Gartner named these two different modes of operation: the centralized data and processing mode is called “collect”, while the decentralized data and processing mode is called “connect”. [1]

46. CSDA capabilities are flexible enough to support both information logistics modes of operation. The main capabilities that need to be aware of where the data resides and whether it’s a connect or collect scenario are Publication within Information Sharing and Exchange and Persistence within Information Logistics. In particular, Channel Management will have to create channels in the mode specified by the SLA's defined in Relationship Management, which will then be configured and operated by Channel Configuration & Operation.

47. In terms of implementation, the simplification of data logistics in a connect mode scenario will likely produce an increase in complexity of data processing, since data processing will have to be shipped to where the data resides and its results integrated back for downstream consumption. The trade-offs of each approach and what degree of decentralization is required need to be evaluated on a use case basis. For example, Data Transformation and Data Integration might be optimally implemented in a centralized way to serve traditional analytics and statistical production based on data collected by surveys via questionnaires (collect mode). In other situations, these capabilities can be implemented in a decentralized manner (connect mode) when the volume, or the number of sources (e.g. IoT), of admin data could create information logistics issues the NSO may want to avoid.

48. Decentralization also affects metadata management. Metadata describes where the data resides and how the channels operate, how the data relates to the rest of the ecosystem, who accesses it and how business processes use it. In addition, the implementation of Metadata & Schema Linkage becomes more complex when data and its metadata live in entirely different environments.

B. Data Management Styles for Modern Data Integration Challenges

49. Modern business requirements demand a more proactive and coordinated data integration and data provisioning strategy founded on a portfolio-based data integration approach encompassing:

  • Bulk/batch: This incorporates a single or multipass/step processing that includes the entire contents of a data file, after an initial input or read of the file has been completed from a given source or multiple sources. All processes take place on multiple records within the data integration application, before the records are released to any other data-consuming application.
  • Message-oriented data movement: This utilizes a single record in an encapsulated object that may or may not include internally defined structure (XML), externally defined structures (electronic data interchange), a single record or other source that delivers data for action to the data integration process.
  • Data replication: This involves simple copying of data from one location to another, always in a physical repository. Replication can be a basis for all other types of data integration, but specifically does not change the form, structure or content of the data it moves.
    • Change data capture (CDC) is a form of data replication that delivers the capability to identify, capture and extract modified data from the data source and apply this changed data throughout the enterprise in near real time. CDC minimizes the resources required for ETL processes because it deals only with data changes.
  • Data synchronization: This can utilize any other form of data integration, but focuses on establishing and maintaining consistency between two separate and independently managed create, read, update, delete (CRUD) instances of a shared, logically consistent data model for an operational data consistency use case. Data synchronization also maintains and resolves instances of data collision, as it can establish embedded decision rules for resolving collisions.
  • Data virtualization: This involves the use of logical views of data, which may or may not be cached in various forms within the data integration application server or the systems/memory managed by that server. Data virtualization may or may not include redefinition of the sourced data.
  • Streaming/event data delivery: This involves datasets that have a consistent content and structure over long periods of time and large numbers of records, and that effectively report status changes for the connected device or application, or that continuously update records with new values. Streaming/event data delivery includes the ability to incorporate event models, inferred row-to-row integrity, and variations of either of those models or the inferred integrity with alternative outcomes that may or may not be aggregated and/or parsed into separate event streams from the same continuous stream. The logic for this approach is embedded in the data stream processing code.


[1] Modern Data Management Requires a Balance Between Collecting Data and Connecting to Data; https://www.gartner.com/doc/3818366/modern-data-management-requires-balance

 

  • No labels