The following classification was developed by the Task Team on Big Data, in June 2013. Comments and feedback are welcome (notify us).



1. Social Networks (human-sourced information): this information is the record of human experiences, previously recorded in books and works of art, and later in photographs, audio and video. Human-sourced information is now almost entirely digitized and stored everywhere from personal computers to social networks. Data are loosely structured and often ungoverned.

  1100. Social Networks: Facebook, Twitter, Tumblr etc.

  1200. Blogs and comments

  1300. Personal documents

  1400. Pictures: Instagram, Flickr, Picasa etc.

  1500. Videos: Youtube etc.

  1600. Internet searches

  1700. Mobile data content: text messages

  1800. User-generated maps

  1900. E-Mail


2. Traditional Business systems (process-mediated data): these processes record and monitor business events of interest, such as registering a customer, manufacturing a product, taking an order, etc. The process-mediated data thus collected is highly structured and includes transactions,reference tables and relationships, as well as the metadata that sets its context. Traditional business data is the vast majority of what IT managed and processed, in both operational and BI systems. Usually structured and stored in relational database systems. (Some sources belonging to this class may fall into the category of "Administrative data").

  21. Data produced by Public Agencies

      2110. Medical records

  22. Data produced by businesses

      2210. Commercial transactions

      2220. Banking/stock records

      2230. E-commerce

      2240. Credit cards


3. Internet of Things (machine-generated data): derived from the phenomenal growth in the number of sensors and machines used to measure and record the events and situations in the physical world. The output of these sensors is machine-generated data, and from simple sensor records to complex computer logs, it is well structured. As sensors proliferate and data volumes grow, it is becoming an increasingly important component of the information stored and processed by many businesses. Its well-structured nature is suitable for computer processing, but its size and speed is beyond traditional approaches.

  31. Data from sensors

      311. Fixed sensors

         3111. Home automation

         3112. Weather/pollution sensors

         3113. Traffic sensors/webcam

         3114. Scientific sensors

         3115. Security/surveillance videos/images

      312. Mobile sensors (tracking)

         3121. Mobile phone location

         3122. Cars

         3123. Satellite images

  32. Data from computer systems

      3210. Logs

      3220. Web logs




  • No labels

3 Comments

  1. user-06905

    I`m not certain where it fits but Transportation statistics (as well as inter and intra national trade statistics and travel statistics) can be augmented through GPS sensor information not only from cars, but from virtually all modes of transportation (trucks, trains, airplanes and ships), perhaps we can expand 3122 to include these other forms of transportation/travel/trade data.

  2. In the aim of trying to apport sommething, and only if you think it could be useful for you, I would like to share with you this taxonomy of Big Data sources, it was proposed for being used in the Quality Framework, and as I see it has many commonalities with your work:

    There is a difference when using Big Data versus data stored on traditional Data Bases, and it depends of its nature, we can characterize five type of sources:

    1. Sensors/meters and activity records from electronic devices: These kind of information is produced on real-time, the number and periodicity of observations of the observations will be variable, sometimes it will depend of a lap of time, on others of the occurrence of some event (per example a car passing by the vision angle of a camera) and in others will depend of manual manipulation (from an strict point of view it will be the same that the occurrence of an event). Quality of this kind of source depends mostly of the capacity of the sensor to take accurate measurements in the way it is expected. 

    2. Social interactions: Is data produced by human interactions through a network, like Internet. The most common is the data produced in social networks.  This kind of data implies qualitative and quantitative aspects which are of some interest to be measured. Quantitative aspects are easier to measure tan qualitative aspects, first ones implies counting number of observations grouped by geographical or temporal characteristics, while the quality of the second ones mostly relies on the accuracy of the algorithms applied to extract the meaning of the contents which are commonly found as unstructured text written in natural language, examples of analysis that are made from this data are sentiment analysis, trend topics analysis, etc.; 

    3. Business transactions: Data produced as a result of business activities can be recorded in structured or unstructured databases. When recorded on structured data bases the most common problem to analyze that information and get statistical indicators is the big volume of information and the periodicity of its production because sometimes these data is produced at a very fast pace, thousands of records can be produced in a second when big companies like supermarket chains are recording their sales. But these kind of data is not always produced in formats that can be directly stored in relational databases, an electronic invoice is an example of this case of source, it has more or less an structure but if we need to put the data that it contains  in a relational database, we will need to apply some process to distribute that data on different tables (in order to normalize the data accordingly with the relational database theory), and maybe is not in plain text (could be a picture, a PDF, Excel record, etc.), one problem that we could have here is that the process needs time and as previously said, data maybe is being produced too fast, so we would need to have different strategies to use the data, processing it as it is without putting it on a relational database, discarding some observations (which criteria?), using parallel processing, etc. Quality of information produced from business transactions is tightly related to the capacity to get representative observations and to process them;

    4. Electronic Files:  These refers to unstructured documents, statically or dynamically produced which are stored or published as electronic files, like Internet pages, videos, audios, PDF files, etc. They can have contents of special interest but are difficult to extract, different techniques could be used, like text mining, pattern recognition, and so on. Quality of our measurements will mostly rely on the capacity to extract and correctly interpret all the representative information from those documents;

    5. Broadcastings: Mainly referred to video and audio produced on real time, getting statistical data from the contents of this kind of electronic data by now is too complex and implies big computational and communications power, once solved the problems of converting "digital-analog" contents to "digital-data" contents we will have similar complications to process it like the ones that we can find on social interactions.

     

  3. Any Classification of Types of Big Data really needs consideration by the UN Expert Group on International Statistical Classifications as potentially this issue is one that should have an agreed international approach. The discussion above already highlights issues in scope and what the concept to be classified should be. Both interesting and good examples.