ACADEMIC YEAR 2013 - 2014
Abstract of the Dissertation
The explosion in the amount of data, called data deluge, is forcing to
redefine many scientific and technological fields, with the affirmation in any
environment of Big Data as a potential source of data.
Official statistics institutions a few years ago started to open up to external
data sources such as administrative data. The advent of Big Data is
introducing important innovations: the availability of additional external data
sources, dimensions previously unknown and questionable consistency, poses
new challenges to the institutions of official statistics, imposing a general
rethinking that involves tools, software, methodologies and organizations.
The relative newness of the field of study on Big Data requires first of all an
introduction phase, for addressing the problems of definition and for defining
the areas of technology involved and the possible fields of application.
The challenges that the use of Big Data poses to institutions that deal with
official statistics are then presented in detail, after a brief discussion of the
relationship between the new "data science" and statistics.
Although at an early stage, there is already a number, limited but growing
practical experience in the use of Big Data as a data source for use in
statistics by public (and private) institutions. The review of these experiences
can serve as a stimulus to address in a more conscious and organized way the
challenges that the use of this data source requires all producers of official
The worldwide spread of data sources (web, e-commerce, sensors) has also
prompted the statistical community to take joint action to tackle the complex
set of methodological, technical and legal problems. And so many national
statistical institutes along with the most prestigious international
organizations have initiated joint projects that will develop in the coming
years to address the complex issues raised by Big Data for statistical
methodology and computer technology.
List of Figures ........................................................................................................................................... 7
Chapter 1 Introduction .............................................................................................................................. 8
1.1 Data sources for official statistics ................................................................................................... 8
1.2 Research questions and Thesis contribution ................................................................................. 10
1.3 Methodology ..................................................................................................................................11
1.4 Structure of the Thesis .................................................................................................................. 14
Chapter 2: Big Data definition and characteristics ................................................................................. 16
2.1 Data deluge ................................................................................................................................... 16
2.2 Big Data definitions ...................................................................................................................... 19
2.3 Big Data technologies ................................................................................................................... 26
2.3.1 MPP Massively Parallel Processing .................................................................................... 26
2.3.2 NoSQL (column oriented databases) ..................................................................................... 28
2.3.3 Hadoop and MapReduce ........................................................................................................ 35
2.3.4 Hadoop ecosystem ................................................................................................................. 41
2.3.5 Big Data and the Cloud .......................................................................................................... 43
2.3.6 Visual analytics, Big Data and visualization .......................................................................... 46
2.4 Applications enabled by Big Data ................................................................................................. 50
2.5 Considerations on Technology and their usage in Official Statistics ............................................ 54
Chapter 3: Big Data usage in Official Statistics - challenges ................................................................. 57
3.1 Data science and statistics ............................................................................................................. 57
3.2 Challenges posed to official statistics ........................................................................................... 59
3.3 Remarks from early experiences ................................................................................................... 64
Chapter 4: First applications of Big Data in statistics. ............................................................................ 66
4.1 First experiences in Google........................................................................................................... 66
4.2 Prices ............................................................................................................................................. 71
4.2.1 Billion Prices Project and PriceStats ...................................................................................... 71
4.2.2 Netherlands experience .......................................................................................................... 76
4.2.3 Other experiences on Prices ................................................................................................... 78
4.3 Traffic data .................................................................................................................................... 80
4.3.1 Netherlands ............................................................................................................................ 80
4.3.2 Colombia ................................................................................................................................ 84
4.4 Social media data .......................................................................................................................... 84
4.5 Mobile phones data ....................................................................................................................... 87
4.5.1 Estonia.................................................................................................................................... 87
4.5.2 New Zealand .......................................................................................................................... 89
4.6 Data about Information Society .................................................................................................... 91
4.6.1 Eurostat .................................................................................................................................. 91
4.6.2 Italy ........................................................................................................................................ 92
4.7 Considerations on first examples of Big Data usage in Official Statistics ................................... 95
Chapter 5: International cooperation on Big Data in Official Statistics ................................................. 99
5.1 High Level Group ......................................................................................................................... 99
5.2 MSIS 2013 .................................................................................................................................. 102
5.3 Task Team ................................................................................................................................... 105
5.3.1 Project proposed by the task team........................................................................................ 107
5.4 Sandbox subproject ......................................................................................................................112
5.4.1 Goals of Sandbox ..................................................................................................................112
5.4.2 Specific objectives ................................................................................................................112
5.4.3 Basis for the Recommendations ............................................................................................113
5.4.4 Recommendations and Resource Requirements ...................................................................114
5.5 Recommendations on Big Data coming from international cooperation .....................................117
Conclusions and future actions ............................................................................................................. 120
Bibliography.......................................................................................................................................... 123
List of Figures
Figure 1 Methodology schema ................................................................................................................ 13
Figure 2 Growth of global storage 2005 - 2015 ...................................................................................... 17
Figure 3 The four "V" for Big Data ........................................................................................................ 19
Figure 4 Massively Parallel Processing configuration for Big Data ....................................................... 27
Figure 5 Oracle RAC configuration for Big Data ................................................................................... 28
Figure 6 A bit of fun on non-relational databases ................................................................................... 29
Figure 7 Still a bit of fun on relationships and NoSQL .......................................................................... 30
Figure 8 Schema for Columnar Databases .............................................................................................. 31
Figure 9 Graphical definition for graph Databases ................................................................................. 33
Figure 10 Graphical representation of CAP Theorem with NoSQL tools .............................................. 34
Figure 11 HDFS Hadoop File System schema ....................................................................................... 36
Figure 12 Hadoop way of working ......................................................................................................... 37
Figure 13 Schema for MapReduce way of working ............................................................................... 39
Figure 14 MapReduce algorithm example .............................................................................................. 40
Figure 15 Google Trends: searches on "Cloud Computing" and "Big Data" ......................................... 44
Figure 16 Tableplot usage for Netherlands Census data ......................................................................... 48
Figure 17 Tableplot usage in statistical production process .................................................................... 49
Figure 18 Comparison between Google Flu Trends estimate and US data on influenza........................ 67
Figure 19 Google Flu Trends updated model after H1N1 pandemic ...................................................... 69
Figure 20 Seasonal Correlation between Winter and the searches for "Pizzoccheri" word.................... 70
Figure 21 Comparison between Argentina Inflation rate and Price Index computed by Pricestats ........ 73
Figure 22 Comparison of official US inflation and Pricestats computed one......................................... 75
Figure 23 Comparison of official Argentina inflation and Pricestats computed one .............................. 76
Figure 24 Price for international flights by days before departure ......................................................... 77
Figure 25 Comparison between Tweets on Prices and Rice Price level ................................................. 79
Figure 26 Schema for Netherlands Data Warehouse for Traffic Information ......................................... 81
Figure 27 Dutch traffic time profile ........................................................................................................ 82
Figure 28 Dutch traffic: normalized number of vehicles for three length categories ............................. 83
Figure 29 Comparison between sentiment analysis from social media and consumer confidence ........ 86
Figure 30 New Zealand: usage of mobile phones for dates close to the quake ...................................... 89
Figure 31 Schema for Istat analysis of Big Data coming from the Internet ........................................... 94
Figure 32 Schema of international groups working on statistical standards......................................... 100
Figure 33 Web page of MSIS 2013 meeting ......................................................................................... 103
Figure 34 Web page for Big Data Inventory ......................................................................................... 106
Chapter 1 Introduction
"OECD Glossary of Statistical Terms - Administrative Source Definition." Web. December 2013.
frequency at which they are produced have led to the concept of 'Big Data'
[Mayer-Schonberger 2013].
Many statistical organizations already started to investigate the possibility of
using the Big Data as a source to complement and support the official
The use of Big Data in official statistics presents many challenges, falling
into the following categories:
a. Legislative, i.e. with respect to the access and use of data
b. Privacy, i.e. managing public trust and acceptance of data re-use and its
link to other sources
c. Financial, i.e. potential costs of sourcing data vs. benefits
d. Management, e.g. policies and directives about the management and
protection of the data
e. Methodological, i.e. data quality and suitability of statistical methods
f. Technological, i.e. issues related to information technology.
5. Which are the first acquisitions and which are the open research fields
to be addressed?
According to the research questions we tried to help advance research by
providing following contributions:
1. Suggestions on the means to integrate Big Data technologies with
technologies already used in Statistical organisations;
2. Summary of examples of Big Data use in Official Statistics;
3. Possible solutions to challenges coming from Big Data usage in Official
4. Recommendations on the possible Big Data usage inside Official
Statistics, describing actions to be taken to maximize results and risky
actions to avoid;
5. Description of international activities recently started to address the
issues coming from Big Data with related open research fields.
1.3 Methodology
In this section the methodological approach is described, explaining how the
research work is carried out in order to answer to proposed research
questions. Figure 1 shows how the methodology is followed.
The first phase is the selection of research domain and the evaluation of
context knowledge. Here the context starts from the analysis of current data
managed by Official Statistics institutions, mainly data coming from surveys
derived from questionnaires submitted to large samples of populations and
the use of statistical data derived from administrative sources (Administrative
Although traditional techniques permit the production of good quality data,
there are problems related to many factors like the cost of traditional surveys,
the time needed for statistical production in a world that increasingly requires
more timeliness for policy decisions, the increase in waste to participate in
surveys due to the increasing burden of statistics, the increasing difficulty to
make use of CATI techniques due to the increasing number of people who do
not have telephones or want to keep it confidential, the cost and labor
required to transform the given administrative data in quality statistical data.
In this context, the Big Data can provide an additional source of investigation
due to their availability and to the fact that they can produce relevant and
timely information. The Big Data provide such information on topics like
connectivity of people, their mobility, the prices of e-commerce transactions,
the job search network, transactions in the property market and on some
people's value orientations. It is a very rich reservoir of information that also
the Official Statistics can try to use. In practice, a third source of data that
cannot be ignored, besides traditional surveys data and administrative data.
The use of Big Data in official statistics, however, poses problems of a
technological nature on the one hand, on the other hand requires changes of
the skills and also new considerations about the significance of the
information produced. In fact, statisticians need to select from the big amount
of data existing only ones that are actually useful from the perspective of
statistical official statistics. The main issue is the way Official Statistics
institutions can use Big Data, granting data quality today assured by
traditional statistical methods, reducing costs, improving timeliness and
avoiding risks for privacy and security and assuring veracity and reliability of
Therefore we describe the big increase of data availability (named data
deluge) due to the increase in the Web adoption and to the development of
Internet of things. Then we list available classifications in literature and
taxonomies of data that fall into the category of Big Data.
In order to use this valuable source of data new arising problems in
information technology cannot be ignored. We will therefore analyze the
main issues that Big Data pose in storage and processing to identify the
technological challenges that must be resolved before you can currently use
Big Data as a source of data. This will allow examination of hardware and
software technologies today available to deal with Big Data.
Figure 1 Methodology schema
Then the analysis of the recent first experiences of Big Data usage to improve
official statistics will be carried out. The uses made so far allow us to provide
a set of suggestions and considerations about methods, organizational
activities and technologies that Statistical Institutions can adopt to start using
Big Data as a way to improve official statistics quality, followed by the
description of internationally coordinated activities that are addressing
research fields already open.
The data exist, the technologies as well, perhaps that's missing is a decision-
making process and also an integration of competences between traditional
approaches in the production of statistics and utilization opportunities of Big
Data. It is necessary to overcome the natural distrust of traditional producers,
helping them to overcome the real problems regarding the reliability,
accuracy and representativeness of the data, but also using new skills to take
advantage of the enormous wealth of information that allows us to extract
information at low cost.
It is a process that official statistics has already crossed when they started to
use administrative sources that today are absolutely current part in the
production of statistical data. At least at the beginning, probably the Big Data
will be used mainly to supplement and enrich the traditional sources of
statistical data, rather than replace them entirely.
Chapter 2: Big Data definition and characteristics
In this chapter we start with the analysis of the explosion of the quantity of
information in recent years; then we list more promising definitions for the
term Big Data and we describe the most important technologies that make
possible their usage. Finally we give some considerations about the usage of
traditional statistical software tools together with Big Data technologies, the
problem of the current skill of statistical institutions staff and some issues
about IT infrastructure.
our personal archives are overcoming the terabyte (one trillion bytes), we are
beginning to use higher multiples.
In the following table we give the units used to measure the data, starting
with those most familiar to those that are essential to measure the Big Data.
All of this data would be useless if we couldnt store it, and thats where
Moores Law [Schaller 1997] comes in. Following this law, which states that
the number of transistors on integrated circuits doubles every two years, since
the early 80s processor speed has increased from 10 MHz to 3.6 GHz - an
increase of 360 (not counting increases in word length and number of
cores). But weve seen much bigger increases in storage capacity, on every
level. RAM as moved from $1,000/MB to roughly $25/GBa price
reduction of about 40,000, and all this together with the reduction in size and
increase in speed. The first gigabyte disk drives appeared in 1982, weighing
more than 100 kilograms; now terabyte drives are consumer equipment, and a
32 GB microSD card weighs about half a gram. Whether you look at bits per
gram, bits per dollar, or raw capacity, storage availability has grown faster
than CPU speed.
Not only the dimension of data is growing, but the nature of these data is also
changing, mainly with the wide usage of social media and of services offered
via mobile phones. The bulk of this information can be called data exhaust,
in other words, the digitally trackable or storable actions, choices, and
preferences that people generate as they go about their daily lives.3
But How Big is Big? Even if Big Data environment of petabytes
dimension are reported, a survey [Devlin 2012] shows that a significant
sample of companies believes that it is "big" a daily flow of data managed
(loaded, analyzed, processed) greater than 1 TB (normally between 1 TB and
750TB), while a common Big Data management system usually ranges
between 110GB and 300TB.
"Word Spy - Data Exhaust." Web. December 2013.
2.2 Big Data definitions
Here we will try to list some of the most used definitions of Big Data. We
know that define Big Data is not an easy task: Opentracker, a Web company
specialized in web-tracking and offering a service called (Big)Data-as-a-
Service, collected more than thirty definitions of Big Data4. We can start
from the classic Wikipedia5 definition Big Data is a collection of data sets so
large and complex that it becomes difficult to process using on-hand database
management tools or traditional data processing applications, but many
authors adhere to the Four Vs definition that points to the four
characteristics of Big Data, namely volume, variety, velocity, and veracity.6
"Definitions of Big Data." Opentracker. Web. December 2013.
"Big Data." Wikipedia. Wikimedia Foundation, Web. December 2013.
What is Big Data 8digits blog, Web. December 2013.
geography, and is smaller than the petabytes and zettabytes often
referenced. Many companies consider datasets between one terabyte
and one petabyte to be Big Data. Still, everyone can agree that whatever
is considered high volume today, will be even higher tomorrow;
Variety (different types of data and data sources): variety is about
managing the complexity of multiple data types, including structured,
semi-structured and unstructured data. Organizations need to integrate
and analyze data from a complex array of both traditional and non-
traditional information sources, from within and outside the enterprise.
With the explosion of sensors, smart devices and social media
technologies, data is being generated in countless forms, including text,
web data, tweets, sensor data, audio, video, click streams, log files and
Velocity (data in motion): the speed at which data is created, processed
and analyzed continues to accelerate. Higher velocity is due to both the
real-time nature of data creation, and the need to incorporate streaming
data into business processes. Today, data is continually being generated
at a rate that is impossible for traditional systems to capture, store and
analyze. For time-sensitive processes such as multi-channel instant
marketing, data must be analyzed in real time to be of value to the
Veracity (data uncertainty): it refers to the level of reliability associated
with certain types of data. The quest for high data quality is an
important Big Data requirement and challenge, but even the best data
cleansing methods cannot remove the inherent unpredictability of some
data, like the weather, the economy, or a customers buying decisions.
The need to acknowledge and plan for uncertainty is a dimension of Big
Data that has been introduced as executives try to better understand the
uncertain world around them.
The above mentioned TechAmerica Foundation Big Data Commission [Mills
2012] gave this definition: Big Data is a term that describes large volumes
of high velocity, complex and variable data that require advanced techniques
and technologies to enable the capture, storage, distribution, management,
and analysis of the information.
In the document Big Data: The next frontier for innovation, competition,
and productivity [Manyika 2011] McKinsey Global Institute use this
definition: Big Data refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, and analyze. This
definition is intentionally subjective and incorporates a moving definition of
how big a dataset needs to be in order to be considered Big Data - i.e., we
dont define Big Data in terms of being larger than a certain number of
terabytes (thousands of gigabytes).
We assume that, as technology advances over time, the size of datasets that
qualify as Big Data will also increase. Also note that the definition can vary
by sector, depending on what kinds of software tools are commonly available
and what sizes of datasets are common in a particular industry. With those
caveats, Big Data in many sectors today will range from a few dozen
terabytes to multiple petabytes (thousands of terabytes).
Tim O'Reilly gave a very short definition that perhaps includes all other: "Big
Data is what happened when the cost of storing information became less than
the cost of making the decision to throw it away."
IDC defines them Big Data technologies describe a new generation of
technologies and architectures, designed to economically extract value from
very large volumes of a wide variety of data, by enabling high-velocity
capture, discovery, and/or analysis [Gantz 2011].
Many other definitions focus on the sources of Big Data, trying to list the
different types of existing data.
One possible taxonomy, coming from a consulting company [Devlin 2012]:
1. Human-sourced information: all information ultimately originates
from people. This information is the highly subjective record of human
experiences, previously recorded in books and works of art, and later in
photographs, audio and video. Human-sourced information is now
almost entirely digitized and electronically stored everywhere from
tweets to movies. Structuring and standardizationfor example,
modelingdefines a common version of the truth that allows the
business to convert human-sourced information to more reliable
process-mediated data. This starts with data entry and validation in
operational systems and continues with the cleansing and reconciliation
processes as data moves to Business Intelligence (Social Networks
2. Process-mediated data: business processes record and monitor
business events of interest, such as registering a customer,
manufacturing a product, taking an order, etc. The process-mediated
data thus collected is highly structured and includes transactions,
reference tables and relationships, as well as the metadata that sets its
context. Process-mediated data has long been the vast majority of what
IT managed and processed, in both operational and BI systems
(Traditional Business systems);
3. Machine-generated data: the output of sensors and machines
employed to measure and record the events and situations in the
physical world is machine-generated data, and from simple sensor
records to complex computer logs, it is well structured and considered
to be highly reliable. As sensors proliferate and data volumes grow, it is
becoming an increasingly important component of the information
stored and processed by many businesses. Its well-structured nature is
amenable to computer processing, but its size and speed is often beyond
traditional approachessuch as the enterprise data warehousefor
handling process-mediated data; standalone high-performance relational
and NoSQL databases are regularly used (Internet of Things).
The UNECE task team on Big Data (see in chapter 5) in 2013 proposed
another taxonomy:
1. Social Networks (human-sourced information): this information is
the record of human experiences, previously recorded in books and
works of art, and later in photographs, audio and video. Human-sourced
information is now almost entirely digitized and stored everywhere
from personal computers to social networks. Data are loosely structured
and often ungoverned. Subcategories:
Social Networks: Facebook, Twitter, Tumblr etc.
Blogs and comments
Personal documents
Pictures: Instagram, Flickr, Picasa etc.
Videos: Youtube etc.
Internet searches
Mobile data content: text messages
User-generated maps
2. Traditional Business systems (process-mediated data): these
processes record and monitor business events of interest, such as
registering a customer, manufacturing a product, taking an order, etc.
The process-mediated data thus collected is highly structured and
includes transactions, reference tables and relationships, as well as
the metadata that sets its context. Traditional business data is the vast
majority of what IT managed and processed, in both operational and
Business Intelligence systems. Usually structured and stored in
relational database systems (Some sources belonging to this class
may fall into the category of "Administrative data").
Data produced by Public Agencies
Medical records
Data produced by businesses
Commercial transactions
Banking/stock records
Credit cards
3. Internet of Things (machine-generated data): derived from the
phenomenal growth in the number of sensors and machines used to
measure and record the events and situations in the physical world.
The output of these sensors is machine-generated data, and from
simple sensor records to complex computer logs, it is well structured.
As sensors proliferate and data volumes grow, it is becoming an
increasingly important component of the information stored and
processed by many businesses. Its well-structured nature is suitable
for computer processing, but its size and speed is beyond traditional
Data from sensors
Fixed sensors
o Home automation
o Weather/pollution sensors
o Traffic sensors/webcam
o Scientific sensors
o Security/surveillance videos/images
Mobile sensors (tracking)
o Mobile phone location
o Cars
o Satellite images
Data from computer systems
Web logs
Another taxonomy, somehow transverse, coming from the international
statistical community can be found in the HLG document on Big Data7:
1. Administrative (arising from the administration of a program, be it
governmental or not), e.g. electronic medical records, hospital visits,
insurance records, bank records, food banks;
2. Commercial or transactional: (arising from the transaction between
two entities), e.g. credit card transactions, on-line transactions
(including from mobile devices);
3. From sensors, e.g. satellite imaging, road sensors, climate sensors;
4. From tracking devices, e.g. tracking data from mobile telephones,
"UNECE Statistics Wikis." What Does Big Data Mean for Official Statistics? Web. December 2013.
5. Behavioral, e.g. online searches (about a product, a service or any other
type of information), online page view;
6. Opinion, e.g. comments on social media.
"United Nations Global Pulse." Home. Web. December 2013.
2.3 Big Data technologies
Talking about technologies enabling the use of Big Data, there are three
fundamental technological strategies for storing and providing fast access to
large data sets:
1. Improved hardware performance and capacity: use faster CPUs, use
more CPU cores (Requires parallel/threaded operations to take
advantage of multi-core CPUs) , increase disk capacity and data transfer
throughput , increased network throughput (MPP);
2. Reducing the size of data accessed: data compression and data
structures that, by design, limit the amount of data required for queries
(e.g., bitmaps, column-oriented databases) (NoSQL);
3. Distributing data and parallel processing: putting data on more disks to
parallelize disk I/O, put slices of data on separate compute nodes that
can work on these smaller slices in parallel, use massively distributed
architectures with emphasis on fault tolerance and performance
monitoring with higher-throughput networks to improve data transfer
between nodes (Hadoop and MapReduce).
As keywords of these three classes of technologies we can use: MPP
(Massively Parallel Processing), NoSQL (Not Only SQL), Hadoop and
Figure 5 Oracle RAC configuration for Big Data
MPP databases has the advantages of being able to scale simply by adding
hardware and of using the standard SQL, so that they can be easily integrated
with ETL (extract/transform/load) [Vassiliadis 2002], visualization, and
display tools, without requiring the introduction of new skills.
There are a variety of different types of database types that today fall within
the general NoSQL category: the most important are the following:
Key-value systems, using a hash table with a unique key and pointer to
a data item [Seeger 2009]. Key-value databases do not require a schema
(like RDBMSs) and offer great flexibility and scalability, do not offer
ACID (Atomicity, Consistency, Isolation, Durability) capability, and
require implementers to think about data placement, replication, and
fault tolerance as they are not expressly controlled by the technology
itself. Key-value databases are not typed and most of the data is stored
as strings. These include Memcached, Dynamo and Voldemort.
Amazons S3 uses Dynamo as its storage mechanism. Also used
extensively is Riak9, an open-source fault-tolerant key-value NoSQL
"Riak Docs." Riak. Web. December 2013.
Columnar systems, used to store and process very large amounts of
data distributed over many machines [Schindler 2012]. Relational
databases are row oriented, as the data in each row of a table is stored
together. In a columnar, or column-oriented database, the data is stored
across rows. It is very easy to add columns, and they may be added row
by row, offering great flexibility, performance, and scalability. When
you have volume and variety of data, you might want to use a columnar
database. It is very adaptable; you simply continue to add columns. The
most important example is Googles BigTable, where rows are
identified by a row key with the data sorted and stored by this key.
BigTable has served as the basis of a number of NoSQL systems,
including Hadoops Cassandra (open sourced from Facebook), HBase
and Hypertable;
Figure 9 Graphical definition for graph Databases
As per CAP theorem [Brewer 2000], there are three primary concerns you
must balance when choosing a data management system: consistency,
availability, and partition tolerance. The theorem states that a distributed
computer system can simultaneously provide only two of them:
Consistency means that each client always has the same view of the
Availability means that all clients can always read and write
Partition tolerance means that the system works well across physical
network partitions
"Amit Piplani.": U Pick 2 Selection for NoSQL Providers. Web. December 2013.
Partition) and AC (Availability-Consistency).
One thing that has become clear is that there is no single solution to Big Data
problems. Instead, there are a variety of different database models emerging
that are more specialized and suitable for handling specific types of
problems. For example, the columnar databases that have been popular
recently are designed for high speed access to data on a distributed basis, and
work well with MapReduce.
But document databases, such as MongoDB and CouchDB work better with
documents, and incorporate features for high speed high volume processing
of document objects. Graph databases are specialized to graph data and key-
value databases are another form of high speed processing format that is
suitable for large data sets with relatively simple characteristics.
In the early 2000s, some engineers at Google looked into the future and
determined that while their current solutions for applications such as web
crawling, query frequency and so on were adequate for most existing
requirements, they were inadequate for the complexity they anticipated as the
Sris Technology Blog. Web. December 2013.
web scaled to more and more users. These engineers defined a new
programming model in which the work is distributed across inexpensive
computers connected on the network in the form of a cluster.
Distribution alone was not a sufficient answer. This distribution of work must
be performed in parallel for the following three reasons:
The processing must be able to expand and contract automatically;
The processing must be able to proceed regardless of failures in the
network or the individual systems;
Developers must be able to create services that are easy to be used by
other developers. Therefore, this approach must be independent of
where the data and computations have executed.
MapReduce [Dean 2004] is a programming model and an associated
implementation for processing and generating large data sets. The term
MapReduce refers to two separate and distinct tasks. The first is the map job,
which takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). The
reduce job takes the output from a map as input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce job is always performed after the map job.
Putting the Map and Reduce functions to work efficiently requires an
algorithm too. The standard steps for a MapReduce work-flow is something
1. Start with a large number or data or records
2. Iterate over the data
3. Use the map function to extract something of interest and create an
output list
4. Organize the output list to optimize for further processing
5. Use the reduce function to compute a set of results
6. Produce the final output.
Figure 13 Schema for MapReduce way of working
By analogy to SQL, the map is like group by and the reduce function is like
an aggregate function (e.g., sum or count) for an aggregate query.
To better understand MapReduce way of working, you can think of map and
reduce tasks as the way a census was conducted in Roman times, where the
census bureau would dispatch its people to each city in the empire. Each
"HBase - Apache HBase Home." HBase - Apache HBase Home. Web. December 2013.
of input data with a series of operators that transform the input data and
produce the desired output;
Sqoop13 (SQL-to-Hadoop) is a tool that offers the capability to extract
data from non-Hadoop data stores, transform the data into a form usable
by Hadoop, and then load the data into HDFS. This process is called
ETL, for Extract, Transform, and Load. While getting data into Hadoop
is critical for processing using MapReduce, it is also critical to get data
out of Hadoop and into an external data source for use in other kinds of
Zookeeper [Hunt 2010] is Hadoops way of coordinating all the
elements of distributed applications, managing Process synchronization,
Configuration Management and managing Messaging between and
among the nodes (across racks).
The Cloud has emerged as a principal facilitator of Big Data, both at the
infrastructure and at the analytic levels [Agrawal 2011]. The Cloud offers a
range of options for Big Data analysis in both public and private cloud
settings. On the infrastructure side, Cloud provides options for managing and
accessing very large data sets as well as for supporting powerful
infrastructure elements at relatively low cost.
The Cloud is particularly well suited to Big Data operations. The virtual,
adaptable, flexible, and powerful nature of Cloud certainly lends itself to the
enormous and shifting environment(s) of Big Data. Cloud architectures
consist of arrays of virtual machines that are ideal for the processing of very
large data sets, to the extent that processing can be segmented into numerous
parallel processes. This affinity was discovered at an early stage of Cloud
development, frequently leading directly to development of Hadoop clusters
that could be used for analytics.
"sqoop". Web. December 2013.
Figure 15 Google Trends: searches on "Cloud Computing" and "Big Data"
"Gnuplot Homepage." Gnuplot Homepage. Web. December 2013.
"The R Project for Statistical Computing." The R Project for Statistical Computing. Web. December
Processing 2. Web. December 2013.
"Try Our Featured Visualizations." Many Eyes. Web. December 2013. http://www-
"Facebook Debunks Princeton Study." FlowingData. Web. December 2013.
Figure 16 Tableplot usage for Netherlands Census data
Figure 17 Tableplot usage in statistical production process
In the following we list the challenges that are posed by Big Data to
visualization tools technology [Zhang 2012]:
Semi- and Unstructured Data. The increasing speed of data
generation brings both opportunity and challenge. In particular, more
and more semi- or unstructured data are generated on-line or off- line. A
large number of data analysis and visualization techniques are available
for analyzing structured data, but methods for modeling and visualizing
semi- or unstructured data are still underrepresented. An effective
Visual Analytics system often needs to be able to handle both, and
ideally integrate the analysis of both types of data for supporting
decision making;
Advanced Visualization. Many commercial products seem slow to
integrate innovative visualization techniques. In particular, some big
software vendors tend to focus on only a small number of standard
visualization techniques such as line charts, bar charts and scatter plots,
which have limited capability in handling large complex data. The
success of more advanced products demonstrate the possibility and
benefit of transferring technical advances developed by academic
research into industrial products;
Customizable Visualization. Given the same data and visualization
technique, different parameter settings may lead to totally different
visual representations and give people different visual impressions.
Designing customizable visualization functions leaves the user the
freedom of changing visual parameter setting and more opportunity to
gain insight from the visualization;
Real Time Analysis. More and more data are generated in real-time on
the Internet (e.g. online news streams, twitter streams, weblogs) or by
equipment or devices (e.g. sensors, GPS, satellite cameras). If analysis
is applied appropriately, these data provide rich information resources
to many tasks. Therefore, improving analytical capability to handle
such data is a development opportunity in current commercial products.
We expect to see more functionality in this respect in the future;
Predictive Analysis. The demand of predictive modeling is increasing,
especially in the business domain, but only very few systems support
predictive analysis. Even with those systems that support predictive
analysis, not many predictive modeling methods are implemented.
Following two different lists: the first with the business applications that can
be enabled by Big Data, and then the possible uses of Big Data in the public
administration. [Cohen 2009] [LaValle 2011] [Brown 2011] [Manyika 2011]
Types of business applications that can be directly enabled by Big Data:
Marketing: revenue generation and business model development,
particularly in retail and consumer packaged goods, where there is
direct or indirect interaction with large consumer markets, moves to a
new level;
Cost containment in real-time becomes viable as electronic event
monitoring from automobiles to smartphones, fraud detection in
financial transaction data and more expands to include larger volumes
of often smaller size or value messages on ever-shorter timescales. Big
Data analysis techniques on streaming data, before or without storing it
on disk, have become the norm, enabling faster reaction to specific
problems before they escalate into major situations;
Real-time forecasting becomes possible as utilities, such as water and
electricity supply and telecommunications, move from measuring
consumption on a macro- to a micro-scale using pervasive sensor
technology and Big Data processes to handle it;
Tracking of physical items by manufacturers, producers and distributors
- everything from food items to household appliances and from parcel
post to container shipping - through distribution, use and even disposal
drives deep optimization of operational processes and enables improved
customer experiences. People, as physical entities, are also subject to
tracking for business reasons or for surveillance;
Reinventing business processes through innovative use of sensor-
generated data offers the possibility of reconstructing entire industries.
Automobile insurance, for example, can set premiums based on actual
behavior rather than statistically averaged risk. The availability of
individual genomic data and electronic medical records presents the
medical and health insurance industries with significant opportunities,
not to mention ethical dilemmas.
For which concerns the public sector, there are many possibilities to use Big
Data to address the mission of government:
Healthcare Quality and Efficiency: health expenditures represent a
growing component of gross domestic product and chronic diseases,
such as diabetes, are increasing in prevalence and consuming a greater
percentage of healthcare resources. The increased use of electronic
health records (ERs) coupled with new analytics tools presents an
opportunity to mine information for the most effective outcomes across
large populations. Using carefully de-identified information,
researchers can look for statistically valid trends and provide
assessments based upon true quality of care;
Healthcare Early Detection: Big Data in health care may involve using
sensors in the hospital or home to provide continuous monitoring of key
biochemical markers, performing real time analysis on the data as it
streams from individual high- risk patients to central systems. The
analysis system can alert specific individuals and their chosen health
care provider if the analysis detects a health anomaly, requiring a visit
to their provider or an emergency event about to happen. This has the
potential to extend and improve the quality of millions of citizens lives;
Transportation: Big Data has the potential to transform transportation in
many ways. Traffic jams in many countries waste energy, contribute to
global warming and cost individuals time and money. Distributed
sensors on handheld devices, on vehicles, and on roads can provide
real-time traffic information that is analyzed and shared. This
information, coupled with more autonomous features in cars can allow
drivers to operate more safely and with less disruption to traffic flow;
Education: Big Data can have a profound impact on education. For
example, through in-depth tracking and analysis of on-line student
learning activities with fine grained analysis down to the level of
mouse clicks researchers can ascertain how students learn and the
approaches that can be used to improve learning. This analysis can be
done across thousands of students rather than through small isolated
Fraud Detection Healthcare Benefits Services Big Data can transform
improper payment detection and fundamentally change the risk and
return perceptions of individuals that currently submit improper,
erroneous or fraudulent claims. This challenge is an opportunity to
explore a use case for applying Big Data technologies and techniques,
to perform unstructured data analytics on medical documents to
improve efficiency in mitigating improper payments. Automating the
improper payment process and utilizing Big Data tools, techniques and
governance processes would result in greater improper payment
prevention or recovery;
Fraud Detection Tax Collection: By increasing the ability to quickly
spot anomalies, government collection agencies can lower the tax gap
the difference between what taxpayers owe and what they pay
voluntarily and profoundly change the culture of those that would
consider attempting improper tax filings. Big Data offers the ability to
improve fraud detection and uncover noncompliance at the time tax
returns are initially filed, reducing the issuance of questionable refunds;
Weather: the ability to better understand changes in the frequency,
intensity, and location of weather and climate can benefit millions of
citizens and thousands of businesses that rely upon weather, including
farmers, tourism, transportation, and insurance companies. Weather and
climate-related natural disasters result in tens of billions of dollars in
losses every year and affect the lives of millions of citizens. New
sensors and analysis techniques hold the promise of developing better
long term climate models and nearer term weather forecasts;
New ways of combining information: Big Data provides an opportunity
to develop thousands of new innovative ways of developing solutions
for citizens. For example, Big Data may be useful in helping
unemployed job seekers find new opportunities by combining their job
qualifications with an analysis and mining of available job opportunities
that are posted on the Internet (e.g., company Websites, commercial
2.5 Considerations on Technology and their usage in
Official Statistics
"Apply SPSS Analytics Technology to Big Data." Apply SPSS Analytics Technology to Big Data.
Web. 31 Dec. 2013.
search at different communities (academic, public and private) to identify
where data scientists are and connect them to the area of official statistics.
To meet these needs, statistical organizations may be interested in recruiting
people from experimental physics, researchers in physical or social sciences,
or other fields with strong data and computational focus. There are also
opportunities for statistical organizations to work with academics and
organizations that could provide the necessary expertise.
In the long term, perhaps the official statistical community could organise
data science training and lectures using existing Advanced Schools in
Statistics in connection with Big Data players (like Google, Facebook,
Apache foundation) that would lead to a kind of certification in data science.
Considerations on IT infrastructure - For which concerns the organization
of IT in statistical institutions, if the volume and velocity of the data is
significantly more than traditional processing ones, consideration needs to be
given to the cost benefit of further enhancing the IT infrastructure once an
understanding Big data processing has been developed. The key bottle neck
points with respect to volume and velocity can be:
the capacity of the NSI to receive the data (bandwidth)
the capacity of the NSI to catalogue and organise the data for Big data
processing environment
the capacity of the Big data processing environment to process the
buckets of data in a sufficiently timely manner.
Two possible options for consideration of Big data processing are
outsourcing and or downsampling.
Given the volumes of data involved, consideration should be given to
whether it is necessary to retain data in any form when they are no longer
required. If there is a requirement to store the raw data, then some options
that could be considered to reduce the volume of data to be stored include
sampling or simply retaining a moving window of data (say the most recent
weeks/months of data).
Chapter 3: Big Data usage in Official Statistics -
This chapter starts analyzing the relationship between data science and
statistics. Then the main challenges that Big Data pose to Official Statistics
are listed distinguished in legislative, privacy, financial, management,
technological and methodological. Finally we list some remarks highlighted
by the first international experiences.
In recent years, following the explosion of Big Data, many studies started
using the term Data Science, referring to an emerging area of work
concerned with the collection, preparation, analysis, visualization,
management, and preservation of large collections of information.
Data science includes a family of disciplines, one of the most important of
which is statistics. In the opinion of Kirk Borne, an influent data scientist,
[Borne 2013], some Big Data users are tempted to do without the key tenets
of statistical reasoning. One reason for this may be that Big Data offers a
convenient path around statistical rigor, since there is so many possible
results and discoveries in large data collections that there is apparently no
need to use the mathematical complexity of statistics.
People could think that if they now have enough data to do 1000-fold cross-
validation or 1000-variable models with millions of training samples, then
statistics must become irrelevant.
Borne mentions four foundational statistical truisms (obvious, self-evident
truths) that are at risk in the age of Big Data:
1. Correlation does not imply Causation - Everyone knows this, but
many choose to ignore it. People may think that this fundamental tenet
of statistics is no longer an important concept when working with Big
Data, since huge numbers of correlations can be discovered now in
massive data collections, and some of these correlations must have a
causal relationship, which should be good enough. The search for
patterns, trends, correlations, and associations in data without
preconceived models is one of the major use cases of Big Data
[McAfee 2012]: correlation mining and discovery. In fact, finding
causes to observed effects would truly be a gold mine of value for any
business, science, government, healthcare, or security group that is
analyzing Big Data;
2. Sample variance does not go to zero, even with Big Data
researchers are familiar with the concept of statistical noise and how
noise decreases as sample size increases. But sample variance is not the
same thing as noise [Allison 2001]. The former is a fundamental
property of the population, while the latter is a property of the
measurement process. The final error in our predictive models is likely
to be irreducible beyond a certain threshold: this is the intrinsic sample
variance. For complex multivariate models, the bigger the sample, the
more accurate will be your estimate of the variance in different
parameters representing the population. It might be one of the
fundamental characteristics of the population. In any domain, as you
collect more data on the various members of the population, you can
make better and better estimates of the fundamental statistical
characteristics of the population, including the variance in each
property across the different classes. Statistical truism #2 fulfills one of
the big promises of Big Data: obtaining the best-ever statistical
estimates of the parameters describing the data distribution of the
3. Sample bias does not necessarily go to zero, even with Big Data -
The tendency to ignore this principle of statistics occurs particularly
when we have biased data collection methods or when our models are
under-fitted, which is usually a consequence of poor model design and
thus independent of the quantity of data in hand. As Albert Einstein
said: models should be as simple as possible, but no simpler. In the
era of Big Data, it is still feasible to settle for a simple predictive model
that ignores many relevant patterns in the data collection. Another
situation in which bias does not evaporate as the data sample gets larger
occurs when correlated factors are present in an analysis that incorrectly
assumes statistical independence. Statistical truism #3 warns us that just
because we have Big Data does not mean that we have properly applied
those data to our modeling efforts;
4. Absence of Evidence is not the same as Evidence of Absence - In the
era of Big Data, we easily forget that we havent yet measured
everything. Even with the prevalence of data everywhere, we still
haven't collected all possible data on a particular subject. Consequently,
statistical analyses should be aware of and make allowances for missing
data (absence of evidence), in order to avoid biased conclusions. On the
contrary, "evidence of absence" is a very valuable piece of information,
if you can prove it. A dramatic example of a failure to appreciate this
statistical concept is the Shuttle Challenger disaster in 1986, when
engineers assumed that the lack of evidence of O-ring failures during
cold weather launches was equivalent to evidence that there would be
no O-ring failure during a cold-weather launch [Casella 1999]. This is
an extreme case, but neglect of statistical truism #4 is still an example
of fallacious reasoning in the era of Big Data that we should avoid.
"UNECE Statistics Wikis." High-Level Group for the Modernisation of Statistical Production and
Services. Web. December 2013.
The first major analysis focuses on changes to the political role of the NSOs
(National Statistical Organizations) that may result from the growing
influence of private producers of Big Data. We can read in the document24:
Big Data has the potential to produce more relevant and timely statistics
than traditional sources of official statistics. Official statistics has been based
almost exclusively on survey data collections and acquisition of
administrative data from government programs, often a prerogative of
National Statistical Organizations (NSOs) arising from legislation. But this is
not the case with Big Data where most data are readily available or with
private companies. As a result, the private sector may take advantage of the
Big Data era and produce more and more statistics that attempt to beat
official statistics on timeliness and relevance.
It is unlikely that NSOs will lose the "official statistics" trademark but they
could slowly lose their reputation and relevance unless they get on board.
One big advantage that NSOs have is the existence of infrastructures to
address the accuracy, consistency and interpretability of the statistics
produced. By incorporating relevant Big Data sources into their official
statistics process NSOs are best positioned to measure their accuracy, ensure
the consistency of the whole systems of official statistics and providing
interpretation while constantly working on relevance and timeliness. The role
and importance of official statistics will thus be protected.
Actually, the use of Big Data as an additional source to be integrated
alongside with traditional survey data and administrative data is a must for
National Statistical Institutes [Balbi 2013], not only for competition reasons
with private sector, but also because of the costs associated to traditional data
collection techniques and the increasing levels of non-response due to the
burden associated to the compilation of a questionnaire, even if proposed
with advanced modalities (web surveys).
Obviously, Big Data potential use for statistical purposes is subject to a set of
challenges not only due to their particular characteristics (high volume,
velocity and variability), but also to the fact that their origin and generation
mode are often completely out of NSOs control. These challenges are:
"UNECE Statistics Wikis." What Does Big Data Mean for Official Statistics? Web. December 2013.
1. Legislative: are Big Data accessible to NSOs, and at what conditions?
a. Legislation in some countries provide the right to access data both to
public and private organizations while other countries guarantee the
right of access only to public entities. This can restrict access to
some types of Big Data;
b. As noted by the ESSNet AdminData25 The right of NSOs to access
admin data, established in principle by the law, often is not
adequately supported by specific obligations for the data holders;
c. Even if legislation grants the access to all types of data, the way to
demonstrate the statistical purpose for accessing the data may be
different from country to country.
2. Privacy: in accessing and processing Big Data, what assurances exist on
the protection of the confidentiality?
a. Definitions may vary from country to country but privacy can be
defined as freedom from unauthorized intrusion26. The problem
with Big Data is that the users of services and devices generating the
data are usually unaware that they generating data.
3. Financial: access to Big Data often has a cost, maybe lower then
statistical data, but sometimes considerable:
a. There are probably costs to acquire Big Data held by the private
sector, especially if legislation is silent on the financial modalities of
acquisition of external data. And so NSOs must balance quality (i.e.
relevance, timeliness, accuracy, coherence, accessibility and
interpretability) against costs and reduction in response burden. The
potential benefits should exceed costs, because new information
coming with Big Data could increase the efficiency of government
b. Report prepared by TechAmerica Foundations Federal Big Data
Commission in the United States [Mills 2012] states that the success
of transformation to Big Data lies in Understanding a specific
"The Use Of Administrative And Accounts Data For Business Statistics." ESSNET Admin Data
Wiktionary. Web. December 2013.
"Privacy." Merriam-Webster. Web. December 2013. http://www.merriam-
agencys critical business imperatives and requirements, developing
the right questions to ask and understanding the art of the possible,
and taking initial steps focused on serving a set of clearly defined
use cases.. This approach can certainly be valid also for NSO
4. Management: what is the impact on the organization of a NSI when Big
Data become an important source of data?
a. Big Data for official statistics means more information coming to
NSOs. This information must be subject to NSOs policies and
directives on the management and protection of the information;
b. The skills requested to Data Scientist [Davenport 2012] are not easy
to find inside the official statistics community. The NSOs should
perform in-house and national scans (academic, public and private
sectors communities) to identify where data scientists are and try to
involve them in the area of official statistics.
5. Technological: what paradigm shift is required in Information Technology
in order to start using Big Data?
a. Collecting data [Parise 2012] in real time or near real time (often
through the use of standard Application Programming Interfaces -
API) maximizes the potential of data, opening new opportunities for
combining administrative data with high-velocity data coming from
other different sources, such as commercial data (credit card
transactions, on line transactions, sales, etc.), tracking devices
(cellular phones, GPS, apps) and physical sensors (traffic,
meteorological, pollution, energy, etc.), social media (twitter,
facebook, etc.) and search engines (online searches, online page
b. The Big Data change of paradigm for data collection presents the
possibility to collect and integrate many types of data from many
different sources. Combining data sources to produce new
information is an additional interesting challenge in the near future.
6. Methodological: what is the impact of the use of Big Data (in
combination or in substitution of statistical data) on the consolidated
methods of data collection, processing and dissemination?
a. Representativity is the fundamental issue with Big Data [Tufecki
2013]. The difficulty in defining the target population, survey
population and survey frame prejudices the traditional way in which
official statisticians think. With a traditional survey, statisticians
identify a target population, build a survey frame, draw a sample,
collect the data etc. With Big Data all these phases can be completely
skipped, and they are no longer under the responsibility of the
statistician. This requires the statisticians to completely change the
way of thinking, since the characteristics of the data should be
considered exogenous;
b. Another issue is both IT and methodological. When more and more
data is being analysed, traditional statistical methods, developed for
the analysis of small samples, run into trouble; in the most simple
case they are just not fast enough. Here is the need for new methods
and tools:
i. Methods to quickly uncover information from massive amounts
of data available, such as visualisation methods and data, text
and stream mining techniques, that are able to make Big Data
small [Dunne 2013];
ii. Methods capable of integrating the information uncovered in
the statistical process, such as linking at massive scale and
statistical methods specifically suited for large datasets.
Methods need to be developed that rapidly produce reliable
results when applied to very large datasets.
c. The use of Big Data for official statistics triggers a need for new
techniques [LaValle 2011]. Methodological issues that these
techniques need to address are:
i. Measures of quality of outputs produced from external data
supply. The dependence on external sources clearly limits the
range of measures that can be reported;
ii. Limited application and value of externally-sourced data;
iii. Difficulty of integrating information from different sources to
produce high-value products.
3.3 Remarks from early experiences
We list below some issues to be addressed when using Big Data for use in
official statistics. Many of following issues come from the first experiences
held in CBS (Dutch National Statistical Institute).
Data exploration: typically Big Data sets are made available to NSIs,
rather than designed by them. Thus data contents and structure need to
be understood prior to using the data for analysis. This is called data
exploration, often involving visualisation methods [Zykopoulos 2012].
Recently some visualisation methods have emerged that are particularly
suited to Big Data. Examples are tableplots [Tennekes 2013] for data
with many variables, and 3D heatmaps [Liu 2013] to study variability
in multivariate continuous data. Data exploration tries to reveal data
structure and assess data quality including exposure of errors,
anomalies and missing data;
Missing data: despite the enormous amounts of data generated each
day, data coming from sensors often suffers from missing data. Official
statisticians need to find way to cope with the missing data problem and
simultaneously reduce the amount of data to a manageable level.
Missing data was experienced in two different case studies: traffic loop
detection data and social media data [Daas 2012]; this may be due to
server downtime and/or network outages. A possible way to overcome it
is focusing on statistical modelling able to cope with missing data, and
the development of information extraction and aggregation methods;
Volatility and resolution: data coming from sensors can fluctuate
considerably from minute to minute. These fluctuations are caused by
real changes in the phenomenon but can be from a statistical point of
view not very informative as they occur at too high a resolution.
Similarly, sentiment analysis at a daily basis may suffer from volatility
that is not seen at weekly or monthly intervals [O'Connor 2010]. It is
therefore needed to develop statistical methods able to cope with
volatile behavior. Possibilities under consideration in CBS are moving
averages and advanced filtering techniques;
Representativity/Selectivity: the analyses in Dutch first experiences
apply to traffic on roads equipped with traffic loop sensors, and to
sentiment analysis of people who post Dutch messages on social media
web sites. These are subpopulations of respectively all traffic on Dutch
roads, and of all people in the Netherlands. The subpopulations covered
by these Big Data sources are not target populations for official
statistics. Therefore data are likely to be selective, not representative of
a relevant target population. Representativity of Big Data could be
assessed through careful comparison of characteristics of the covered
population and the target population. This can be problematic, as often
there are no characteristics readily available to conduct such
comparison. For example, little is known about the people posting on
social media. Often only their user name is known but not their age or
gender. In situations where at least some background information is
available, the selectivity issue can be assessed, and addressed if
necessary. This could be achieved through predictive modeling, using a
wide variety of algorithms known from statistical learning and from
application of data mining methods in official statistics [Buelens 2012];
Long-term stability may be a problem when using Big Data. Typically,
statistics for policy making and evaluation are required for extended
periods of time, often covering many years. The Big Data sources
encountered so far seem subject to frequent modifications, possibly
compromising their long term use;
Privacy and data ownership are other issues that need to be addressed,
as many potential Big Data sources are collected by non-governmental
organizations, a situation that may not be covered by existing
Chapter 4: First applications of Big Data in
"Flu Trends." Google. Web. December 2013.
Figure 18 Comparison between Google Flu Trends estimate and US data on influenza
Figure 19 Google Flu Trends updated model after H1N1 pandemic
After the experience of Flu Trends, Google released a more general website,
named Google Correlate, an online, automated method for query selection
that does not require such prior knowledge [Mohebbi 2012].
Using Correlate, given a temporal or spatial pattern of interest, users can
determine which queries best mimic the data. These search queries can then
be used to build an estimate of the true value of the phenomenon. This model
has been used also to produce accurate models of home refinance rate in the
United States.
In addition Google researchers showed that spatial patterns in real world
activity and temporal patterns in web search query activity can both surface
interesting and useful correlation where users can use their own data and find
search patterns which correspond with real-world trends.
As an example we can see in next picture the correlation between the winter
time and the search for the word pizzoccheri a typical winter dish of
northern Italy28.
Figure 20 Seasonal Correlation between Winter and the searches for "Pizzoccheri" word
"Pizzoccheri." Wikipedia. Wikimedia Foundation Web. December 2013.
4.2 Prices
"Eurostat E-commerce Statistics." Statistics Explained RSS. Web. December 2013.
"The Billion Prices Project @ MIT." The Billion Prices Project MIT RSS. Web. December 2013.
one. The biggest difference is on the basket, that when BPP started was
simply composed by all the prices that could be collected online.
The online measure has the advantage that it comes out daily, allowing
researchers to examine daily price changes. The results of BPP index are very
close to the US CPI (Consumer Price Index), collected by US Bureau of
Labor Statistics31 at 23.000 retailers in 90 US cities, for a cost over 200
million $ a year. The official CPI is based on a number five times lower than
that of BPP project. New York Times so commented the new:Data on prices,
once monopolized by government gatekeepers, are now up for grabs.32
The project was managed as a research project in MIT. The research uses
web collected elementary data to study prices [Cavallo 2011b]. The current
research fields are:
Pricing Behavior: What drives price stickiness around the world?
How much can be explained by current and past inflation? How
much by competition and structure of industries? Are prices
synchronized between commodities and between countries?
Daily Inflation and Asset Prices: How daily inflation indexes
across countries and sectors do match official statistics? Which are
the links between daily inflation, asset prices, and inflation
Pass-Through: How much do prices adjust internally when the
exchange rate, or the international price of commodities change?
Green Markups: What premium is paid in stores for green or
organic products? Storing data from multinational retailers, they
can compute premium differences - for exactly the same items - in
different places.
The MIT project shows that also in Macroeconomics the availability of (big)
data coming from the Web allow scientists to open new research fields.
"CPI News Releases." U.S. Bureau of Labor Statistics. Web. December 2013.
The Real-time Inflation Calculator The New York Times online. Web. December 2013.
The most famous usage of BPP data was an alternative inflation index for
Argentina, which is updated on a daily basis in the website Inflacion
Verdadera33 [Cavallo 2009]. The website showed that the price index
provided by the Argentine government did not represent the actual change in
prices. For the first time data coming from National Statistical Institutes were
contested on the basis of Big Data obtained from the web. In Figure 20 you
can see the difference between the official CPI and the price index calculated
at MIT.
Figure 21 Comparison between Argentina Inflation rate and Price Index computed by
In 2010 Rigobon and Cavallo, together with other colleagues, founded
PriceStats34, a company with the objective of bringing the academic research
to market [deHaan 2013].
PriceStats uses the following flow to generate his indexes:
"The Billion Prices Project @ MIT. Web. December 2013.
"PriceStats ." PriceStats . December 2013.
Scraping: PriceStats uses web scraping technologies to monitor online
prices every day. Web scraping is the process of automatically
collecting information from the web by converting unstructured data
(often in HTML format) into structured datasets that can be stored and
analyzed. PriceStats uses a combination of commercial and custom
scraping solutions to address the complexities of monitoring prices
across thousands of retailers;
Online Retailers: a key step of PriceStats approach is to identify the
best retailers to use for inflation measurement. PriceStats selects
retailers with large market shares, in relevant cities, that sell both
offline and online. In most countries, their data covers key economic
sectors such as food, clothing, electronics, furniture and energy. Even if
some category, mainly services, is not covered, this, however, is not a
problem for the goal of detecting the main changes in inflation trends.
Services are usually quite stable, not the main source of volatility, and
can be indirectly monitored through items with similar price behavior;
Processing: once the data collection is complete, PriceStats runs
automatic procedures to ensure that the data can be used for inflation
measurement. The data is first structured and cleaned so that it can be
used in a consistent manner. Price data is recorded in a standard format,
data is then categorized across economic sectors and a set of
performance statistics are automatically calculated to evaluate the
quality of the data;
Inflation Statistics: finally daily inflation statistics is computed using
advanced econometric techniques and leveraging official weights as
much as possible.
PriceStats inflation indices show that online prices are a successful measure
of inflation, despite online price sources being different to those of official
inflation estimates. Although online and offline prices may have different
nominal values, price changes tend to follow similar trends. Since inflation
statistics measure price changes, online prices are a great way to measure
Following chart shows how accurately PriceStats conforms to the Consumer
Price Index data released by the US government (Figure 21).
"PriceStats." State Street Global Exchange Research Risk Indices. Web. December 2013.
Figure 23 Comparison of official Argentina inflation and Pricestats computed one
One challenge when using internet robots is to keep them working when
websites change. These changes may vary from simple layout and style
changes to complete makeover in terms of technology and communication
structure. CBS has also tried to quantify the time required to modify the robot
as a result of changes to the Web sites, concluding that the method of data
collection remained affordable, even taking into account the necessary
adjustments to the robots.
After the experiences on oil and flights prices, CBS focused on housing
market sites. Since these sites have large amounts of information presented in
harmonized ways, they are primary candidates for automated data collection.
Starting from the beginning of 2011 data on housing process for one province
in the Netherlands have been collected from three separate internet sites.
"Statistical Data." National Bureau of Statistics of China. Web. December 2013.
"China Turns to Big Data to Gauge Inflation." - Xinhua. Web. December 2013.
companies include e-commerce giant Alibaba41, leading search engine
Baidu42, China United Network Communications43, the country's second-
largest mobile operator; and the FANYA Metal Exchange44, one of the largest
spot trading and investing platforms for rare metals.
Ma Jiantang, head of NBS, said at the signing ceremony of the strategic
partnership that the era of producing, sharing and using data is coming: "Big
Data will become the foundation of government management, social
management and macro-economic control". "This is only our first step," he
said. "We will cooperate with more companies in the future."
Huang Linli, a senior analyst with Baidu, said her company has some natural
advantages in developing Big Data technology: "Netizens' search requests
exceed 5 billion a day on The data generated from those searches
will be very valuable to the government for making predications on the
economy, as well as other sectors."
4.3.1 Netherlands
One of the first National Statistical Institute to start experimenting Big Data
usage in Official statistics was Statistics Netherlands (CBS45). Here we
present the experience on Netherlands Analysis of Traffic loop detection data
[Daas 2013].
In Netherlands there is a central authority collecting data on traffic. The
National Data Warehouse for Traffic Information (NDW46). Real-time traffic
data gives a picture of the current traffic situation on the roads. Every minute,
data from more than 20,000 measuring sites in the Netherlands is collected
"Manufacturers, Suppliers, Exporters & Importers from the World's Largest Online B2B Marketplace-" Alibaba. December 2013.
"." . Web. December 2013.
"About Us." _China Unicom. Web. December 2013.
"." . Web. December 2013.
"CBS - Home." CBS - Home. Web. December 2013.
"Home." - Nationale Databank Wegverkeersgegevens. Web. December 2013.
by the database and within 75 seconds distributed to users of the data. It
concerns the following data:
Traffic flow
Realised travel time
Estimated travel time
Traffic speed
Vehicle classes
"The R Project for Statistical Computing." The R Project for Statistical Computing Web. December
Next the number of vehicles in various length categories (small, medium-
sized and large) was studied. The Figure 27 illustrates the difference in
driving behavior. The small vehicles have clear morning and evening rush-
hour peaks at 8 am and 5 pm respectively, in line with the overall profile. The
medium-sized vehicles have both an earlier morning and evening rush hour
peak, at 7 am and 4 pm respectively. The large vehicle category has a clear
morning rush hour peak around 7 am and displays a more distributed driving
behavior during the remainder of the day. Remarkable is the decrease in the
relative number of medium-sized and large vehicles detected at 8 am. This
may be caused by a deliberate action of the drivers of the medium-sized and
large vehicles of wanting to avoid the morning rush hour peak of the small
Figure 28 Dutch traffic: normalized number of vehicles for three length categories
In Figure 27 normalized number of vehicles is distinguished by class of
vehicle length, after correcting for missing data. Small (<= 5.6 meter),
medium- sized (>5.6 and <= 12.2 meter) and large vehicles (> 12.2 meter)
are shown in black, dark grey and grey, respectively.
4.3.2 Colombia
The National Roads Institute of Colombia uses GPS data to improve traffic
circulation and to serve as input for transport statistics. With this method, cars
do not have to stop at toll stations: an electronic tracking device installed in
the vehicle is read when it enters the toll.
The tracking device also contains all the information concerning the vehicle,
which complements that of the National Single Transit Register (Registro
nico Nacional de Trnsito48), an Information system for managing
centralized and validated information regarding cars, drivers, traffic
accidents, traffic insurance and public transport companies.
This new method has already enhanced control of traffic flows and has led to
the strengthening of transport statistics.
"RUNT Colombia." RUNT / Colombia. Web. December 2013.
Then CBS tried to analyze manually a sample of the tweets and manual
classification showed that the majority of the non-hashtag messages belonged
to the other group (51%) (these kind of messages are referred to as pointless
babble in some studies). Apart from these kinds of messages, the non-
hashtag containing tweets in the sample were predominantly found to be
related to the themes Spare time, Sport, Media and Work. The results of this
first study reveal that on Twitter topics of potential interest for official
statistics are discussed. Topics for which twitter messages could provide
information from an official statistics point of view are mainly those related
to work and politics.
Another potential use of social media messages is sentiment analysis [Java
2007]. CBS started an experiment accessing over 1.6 billion messages written
in Dutch from a large number of social media sites was obtained through the
Coosto49 infrastructure. Messages sourced from the largest social media sites
(Twitter, Facebook, Hyves, Google+, and LinkedIn) but also from numerous
public blogs and forums. Researchers used June 2010 as the starting date
with August 2012 as the end date. With a query language and a web interface,
messages were selected from the database. The sentiment of each message
was automatically determined by counting the number of positive and
negative words following the approach described in Golder and Macy
[Golder 2011] and messages were classified as positive, negative or neutral.
CBS tried to link the sentiment in social media with consumer sentiment in
the Netherlands. They started by testing a wide range of somehow correlated
with consumer sentiment; such as buy and mortgage. But this proved very
difficult, because some words were hardly used and others showed no clear
or stable dependence. Then they tried another approach: using very general
terms. Interestingly, this general approach worked quite well.
Queries returned very large amounts of messages, around 600 million for the
Dutch articles and 1.2 billion for the 10 most frequently used Dutch words
for the period studied, of which the overall sentiment (sometimes called The
Mood of the Nation) was analysed. The monthly sentiment for the period
June 2010 - August 2012 derived from Dutch social media messages was
found to correlate very strongly (0.83) with the officially determined monthly
"Online Radar." Coosto. Web. December 2013.
Dutch consumer confidence50 and with the sentiment for the attitude towards
the economic climate (0.88). Both official indicators are based in Netherlands
on a sample survey in which 1500 people are interviewed each month.
Figure 29 Comparison between sentiment analysis from social media and consumer confidence
"Consumer Confidence Survey." CBS. Web. December 2013.
4.5 Mobile phones data
4.5.1 Estonia
Since 2009 the Central Bank of Estonia has been using state-level inbound
and outbound tourism statistics (trips, spent nights) based on mobile
positioning data and calibrated with official accommodation and travel
statistics [Ahas 2008]. The monthly data flow is used in the calculation of the
national balance of payments. The initiation of this need came from border
surveys being cancelled due to financial cutbacks.
Using mobile positioning data in statistics has several positive aspects as
speed of data collection, digital format of data, large sample and high
penetration of phones in most of societies. Mobile data has also several
shortcomings that we have to keep in mind when interpreting the results. One
of the weaknesses is that we do not know the exact motivations and relations
lying behind those visits. The most important question is related to sampling:
often mobile phones are not used by lower income groups in foreign
countries due to roaming costs. Calling can also connected with cultural
differences, such as calling regulations and traditions.
Another problem that arises in case of using mobile positioning data is its
quantitative structure we know the locations of calls (dots), but we do not
know who is really making the calls, what kind of visit he/she is on, and what
kind of transportation he/she is using. The huge amount of quantitative data
also poses a problem for data processing and cleaning; the databases are too
large to be managed using traditional software and data preparation options.
In the following generic issues for mobile data usage in official statistics are
reported [Karlberg 2013]:
Privacy concerns: although there has been a cultural shift, with people
being increasingly willing to share, or even actively disseminate
(Facebook) their personal data, large-scale provision of mobile
positioning data to government agencies could be perceived as an
invasion of privacy;
Data protection legislation: there are a number of EU-level
instruments with different aims. The production of official statistics
could possibly be considered as a statutory basis for which processing
of personal data (such as mobile positioning data) could take place.
Moreover, different European countries have different national
transpositions of the EU directives;
Data provider reluctance: while mobile network operators have an
interest in this initiative, there are issues concerning the (i) the
maintenance of business secrets; (ii) direct costs of providing data, (iii)
effects of the data extraction workload on the real-time systems; (iv) the
opportunity cost of giving away their data for free to National
Statistical Institutes;
Technological barriers: as there are a number of providers in each
country, a technical solution concerning how to merge data from
different operators into one single analysis data set need to be resolved.
This particularly concerns at which step anonymisation should take
Standardisation: technical platforms and data formats may differ
between operators, but the data provided must be standardised both in
terms of format, content and frequency (temporal granularity). The
format, content and attributes of the data should be stable over time.
Also algorithms used should be the same across operators, the issue of
sampling and representativity needs to be tackled;
Provision frequency: the frequency by which operators transmit data
to NSIs (near-real time, daily, weekly, monthly etc.) should be defined.
Scalability and speed of the processing is also an issue the processing
time should be independent of the size of the operator.
4.5.2 New Zealand
Figure 30 New Zealand: usage of mobile phones for dates close to the quake
4.6 Data about Information Society
4.6.1 Eurostat
4.6.2 Italy
4.7 Considerations on first examples of Big Data usage in
Official Statistics
Here we list some considerations derived from first examples of Big Data
usage by statistical organizations:
Automated or assisted data collection - Following first experiences in data
collection from the internet, it is useful to distinguish between (i) automated
data collection, i.e. collection of data from internet sites with many similar
items approached with internet robots that run without user interaction and
(ii) robot-assisted data collection for collecting data from internet data
sources with only a few items. For the second category, it's important to assist
the data collector to check for changes in data on internet sites. Both
automated and robot-assisted data collection have proven to be viable options
for official statistics. Automated data collection can result in more detailed
data compared to data collected in traditional ways, which may be used to
validate our work, to improve efficiency or to reduce response burden. Also,
this kind of collection methods may be used to study phenomena in a
completely new way. Robot assisted data collection appeared to be useful to
collect prices from many different internet sites in an efficient way.
Netiquette - When statistical institutions use the internet to collect data they
have to respect internet etiquette (netiquette): many sites include a file named
robots.txt in their root to specify how and if crawlers should treat that site
and usually all statistical institutions respect this robot exclusion protocol.
Also, in order not to negatively influence site performance, the standard
suggestion is to configure robots to run nicely, respecting a commonly
accepted waiting time of one second between requests. In addition, to operate
as transparently as possible, robots should identify themselves as being from
Official Statistics Agency via the user-agent string.
Communicating with site owner - One suggestion is to study the
possibilities of internet sources for a while first, and then to start
communicating with website owners, because they know their data better
than anyone else. Many website owners could offer to send the data directly
from their back-office system. If possible, it's better to opt for the direct
connection rather than the robot solution, as it is expected to be more stable
and may even contain more interesting data such as numbers of items sold (as
in scanner data).
Another reason to communicate with data owners is that they can already
have some API (application programming interface) available to give
partners access to his data. In some case in Netherlands they opened it for
Statistics Netherlands that started using API in combination with the generic
robot framework to access the data.
Website identification - Another important issue is the identification of the
websites to use for collecting data. For instance if you have to collect data for
Consumer Price Index you have to know which sites would offer reliable
data, which sites would offer original content and which would replicate the
content of others, how easy it would be to read the data, which variables are
available and how comparable they are across different sites. In addition, you
have to know how the volume of data grow and how volatile the data were.
This is typical for internet data collection. Unlike more traditional data
sources (administrative sources, questionnaires) where data characteristics
are known by the delivering organisation, or controlled by the statistics office
in case of questionnaires, statistics offices do not control internet data. It is
more like observational data where you have to work whit data you can get.
Legal framework - Additional legal steps are needed to enable the
production of official statistics using big data. The current legislative
framework for statistics in many countries does not cover access and use of
big data, both within government and from private sector. So it is particularly
difficult to gain access to the big data collected and kept by other parties.
Furthermore, a privacy framework is needed that sets the ground rules for
how big data sets can be combined, protected, shared, exposed, analysed and
retained. This would address the significant issue of public trust in the
appropriate use by government of the personal data of individuals. It is
important to maintain public trust: individuals must be sure that their
personal information will be well protected. For example, in the area of
mobile telephone location data, even if identification is suppressed, people
will still be highly concerned about the transfer of such information from the
mobile providers to other parties like NSIs. Similarly, mobile device
providers need guarantees that privacy rights will not be violated when they
turn over their data to the Government.
Errors management Big Data errors can occur at various stages. Usually
when Statistical institutions manage surveys, they apply the Total Survey
Error [Biemer 2012]: this methodology could be reviewed to determine how
it could be applied to Big Data.
Some type of errors could be source specific; others would potentially apply
to all sources (i.e. construct validity, coverage error, measurement error, error
due to imputation, non-response error). Sampling error would apply to the
specific cases where sampling techniques are used. A few examples of the
type errors are given below, distinguished according to the different big data
Human beings: measurement error (mode effect, respondent related
factors involving cognitive aspects and/or the will to declare the true
Information system: lack of consistency in the construct used in
different files;
Machines/sensors: measurement errors caused by malfunction, misuse,
Data cleaning - when processing Big Data, the processing steps should
include a preliminary reception function, where data are first verified and
pre-treated, followed by a more in-depth analysis where erroneous data,
outliers, and missing values are flagged to be processed.
All types of big data sources can potentially suffer from partial non-response
(missing values for specific variables). So for data cleaning you have to
consider following points:
Knowledge about the data and the associated metadata is a key factor in
the development of efficient processing methods;
Given the overall size of the data, outliers may not be influential
compared to traditional statistical processing;
While imputation methods are well known for numerical data in
traditional statistics, still few experiences are available for imputing text
strings and moreover other type of unstructured data.
Unstructured data - The huge increase in the availability of unstructured
textual data requires statistical institutes to increase investments in tools able
to analyze textual data coming from the web. These tools, starting from web
scraping to textual analysis and to text mining, should become part of the
standard toolbox of statisticians and data analysts.
Chapter 5: International cooperation on Big Data in
Official Statistics
In this final chapter the initiatives taken on Big Data from international
bodies and the most advanced national statistical offices are described. A
special attention is given to the HLG project with the aim to create a shared
environment in which to experiment in a coordinated way new methods and
new software tools on large statistical archives. Then we list suggestions for
the steps to be followed when one statistical organization start dealing with
Big Data. Finally we indicate some recommendations, which we believe
would be important to follow in the field of international work on Big Data.
"UNECE Statistics Wikis." High-Level Group for the Modernisation of Statistical Production and
Services. Web. December 2013.
"Conference of European Statisticians (CES) - UNECE." Conference of European Statisticians (CES)
- UNECE. Web. December 2013.
In 2011 the HLG has released its strategy to implement the vision for
modernizing official statistics59, which was endorsed by the Conference of
European Statisticians (CES) in June 2011. In this document HLG addresses
the issue of the so-called data deluge: The main theme of the vision was
that statistical organizations are confronted with accelerating change in
society and the way that data are produced and used within the information
industry. Official statistics faces all of the opportunities and threats that
accompany a data deluge.
In its vision HLG then analyses the key factors to define the role of official
statistics. They include factors like Quality, Trust, User needs, Strategic
partnerships, Common standards and others. In the chapter dedicated to
Rationalising processes there is a proposal (d) Develop new
"UNECE Statistics Wikis." HLG Strategy. Web. December 2013.
methodologies to reflect the changes in data acquisition and the dramatic
increase of the volume of data available, for example, on topics such as noise
and error reduction in large data sets, pattern recognition and other
methodological tools appropriate for "Big Data".
Following this proposal, in March 2013 the HLG published the paper What
does Big Data mean for Official Statistics60. The paper starts with this
sentence this paper will seek to address two fundamental questions, i.e. the
What and the How: What subset of Big Data should National Statistical
Organisations (NSOs) focus on given the role of official statistics, and How
can NSOs use Big Data and overcome the challenges it presents?
An important initial consideration compares the potential of Big Data to
produce relevant and timely statistics and the traditional sources of official
statistics. In the past official statistics has been based on survey data
collections and on administrative data from government programs. But with
the great change brought on by Big Data, most data are now available on the
Web or through private companies. And so the private sector, using Big Data,
can produce statistics that are timelier and more relevant than official ones.
On the other hand National Statistical Offices have a long experience in
managing large amounts of data, addressing their accuracy, consistency and
interpretability. Integrating Big Data sources into official statistics processes
can be the best way for NSOs to protect the role of official statistics while
working on relevance and timeliness.
In the paper we can find an analysis of Big Data definitions, a list of the
challenges posed to official statistics and a set of examples of possible uses
of Big Data in official statistics. At the end of the document we find a final
paragraph of recommendations:
- During the next two years there is a need to identify a few pilot
projects that will serve as proof of concept with the participation of a
few countries collaborating. The results could then be presented to the
- Statistical organisations are encouraged to address formally Big Data
issues (methodological and technological challenges) in their work
"UNECE Statistics Wikis." What Does Big Data Mean for Official Statistics? Web. December 2013.
programmes by undertaking research and pilot projects in selected
- New exploration and analysis methods are required: Visualization
methods, Text mining, and High Performance Computing;
- Successful use cases on Big Data analytics tools and solutions should
be brought to the attention of the international statistical community;
- National examples of collaboration of NSOs with private data source
owners in this field, addressing some of the issues of granting NSOs
privileged access to private sourced Big Data should be part of the
priority actions;
- To use Big Data, statisticians are needed with analytical mind-set, an
affinity for IT (e.g. programming skills) and a determination to extract
valuable knowledge from data. These so-called data scientists can
be derived from various scientific disciplines;
- In the short to medium terms, NSOs should develop the necessary
internal analytical capability through specialised training and
international collaboration;
- There's a need for the drafting of guidelines / principles for the
effective use of Big Data for purposes of official statistics;
- HLG should ensure that the outputs of several future activities on Big
Data (dedicated session, workshops and seminars) are effectively
coordinated and communicated at the strategic level.
"Meeting on the Management of Statistical Information Systems." - UNECE. Web. December 2013.
Based on experiments so far, it is likely that Big Data will require new
methods and infrastructures, and new ways of defining and measuring
A classification of types of Big Data is needed;
This is an ideal time to start collaborating on Big Data, as we dont yet
have systems in place. An architecture for Big Data is needed, as well
as collaboration with the wider information industry;
We have common issues in the use of Big Data, so we need
mechanisms to work together to find solutions. This should be a priority
issue for the HLG;
It is important to take a multidisciplinary approach to Big Data,
currently different groups are all looking at this issue from their own
Agreeing a common classification of the different types of Big Data
should be an early priority;
We need a concrete project to produce specific statistics from Big Data,
and to find real solutions;
A virtual task team should be set up to define the issues and formulate a
Processing time needs to be improved. Taking over a year to produce
census results is no longer acceptable. We should aim for real-time
processing, as is increasingly the case in the commercial sector;
The fact that Big Data are often not stored permanently is an issue
because resulting statistics may not be reproducible, which has
implications for quality management;
The size of Big Data should not be a major issue as storage and
processing capabilities are constantly increasing;
We need to identify and develop the knowledge and skills necessary to
use Big Data. New skill-sets will be needed;
Legal and licensing issues need to be addressed, particularly regarding
consistency of data supply;
Big Data provides an opportunity for the CSPA project to deliver useful
outputs where nothing currently exists.
Interestingly, many speakers emphasized the close link between the theme of
Big Data and standardization initiatives on architectures, based on the so-
called model plug-n-play (CSPA62).
In May 2013 a temporary task team63 was set up, composed by members
coming from thirteen countries or international organization. The task team,
working virtually through teleconferences and sharing documents on the
wiki, started to define to work out the key issues with using Big Data for
official statistics, identify priority actions and formulate a project proposal.
As preliminary activities, two tasks have been undertaken by the temporary
task team.
"UNECE Statistics Wikis."Common Statistical Production Architecture Home Web. December 2013.
"UNECE Statistics Wikis." Members of the Task Team. Web. December 2013.
The first was the formulation of a classification scheme for Big Data sources
(see Chapter 2) and the identification of the attributes of these sources
relevant to their use in the production of official statistics.
The second was the development of an inventory64 containing structured and
searchable information about actual and planned use of Big Data in statistical
organizations. The inventory aims to include the key resources that will be
most valuable for the official statistics community. Where a resource already
exists somewhere on the Internet, the inventory holds links to resources and
documentation, to avoid duplication and reduce problems of version control.
"UNECE Statistics Wikis." Big Data Inventory. Web. December 2013.
The "owner" of the inventory is the High-Level Group on the Modernisation
of Statistical Production and Services, on behalf of the international statistical
For each Big Data source the inventory collects following information:
Country/Organization name
Contact person for the case study (name, email, telephone)
Type of Big Data used
Project description
National or international scope of data source
Public/private source
Data access framework
Payment for data (Yes/No/Fees)
Data access
Statistical domain and use of data
Degree of progress in use of the Big Data source
Tools and methods for processing
Privacy and confidentiality issues
Links and attachments
The project proposed to HLG by the task team has three main objectives:
To provide guidance for statistical organizations to identify the main
possibilities offered by Big Data and to act upon the main strategic and
methodological issues that Big Data poses for the official statistics
To demonstrate the feasibility of efficient production of official
statistics using Big Data sources, and the possibility to replicate these
approaches across different national contexts;
To facilitate the sharing across organizations of knowledge, expertise,
tools and methods for the production of statistics using Big Data
The work for the project will start from the major challenges listed in the
HLG paper What does Big Data mean for Official Statistics? (see 4.1), i.e.
legislative, privacy, financial, management, methodological and
technological. The task team tried to identify a variety of issues and
expressed them in the form of questions:
How can we assess the suitability of Big Data sources for the
production of official statistics?
How can we actually benefit from the increased timeliness offered by
many Big Data sources?
Can we identify best practices for the major methodological issues
relating to Big Data? E.g.:
o Methods for reducing data volume
o Methods for noise reduction
o Methods for ensuring confidentiality and avoiding disclosure
o Methods for obtaining information on statistical concepts (text
mining, classification methods, etc.)
o Methods for determination of population characteristics, e.g.
correlating words used by social media users with certain
demographic characteristics
Should Big Data be treated as homogeneous, or do they require
different treatment according to the role they play in the production of
official statistics?
o Experimental uses
o Complementing existing statistics e.g. benchmarking and validity
o Supplementing existing sources, permitting the creation of entirely
new statistics
o Replacing existing sources and methods
Are there 'quick wins', applicable beyond Big Data, such as data
storage, technology, advanced analytics, methods and models which
could transform our thinking in relation to the production of official
statistics more generally?
How should statistical organizations react to the novel idea that in a Big
Data world there are no 'bad' data (they all tell us something)?
How can organizations mitigate the risk of a data source ceasing to
exist, or changing substantially, when it is outside the control of the
How can Big Data be combined with survey data? How can we manage
the transition from statistical data production based on surveys to
production based substantially on Big Data?
Do we need a research question before exploring a Big Data source, or
should we just experiment to see what is possible?
What becomes of the time series in a world where data sources and uses
may become more transient?
How will institutional structures need to change in order to support the
use of Big Data and ensure its quality and the quality of resulting
The output will take the form of recommendations, good practices and
guidelines, developed through broad consultation of experts throughout the
official statistics community, and coordinated by expert task teams. The
material will be collated in a Web environment such as a wiki so allowing the
guidelines to function as a 'living document', permitting timely updating.
The sandbox will form the practical element of the project, aimed at
proving concepts in two related strands:
(a) Statistics:
the possibility of producing reliable statistics from novel sources,
including the ability to produce statistics which correspond with
existing 'mainstream' products, such as price statistics;
the cross-country applicability of new analytical techniques and
sources, such as the analysis of data from social networking websites.
(b) Tools:
the efficiency of various software tools for large-scale processing and
the applicability of the Common Statistical Production Architecture
(CSPA) to the production of statistics using Big Data sources.
A web-accessible environment for the storage and analysis of large-scale
datasets will be created and used as a 'sandbox' for collaboration across
participating institutions. Some internationally-relevant datasets will be
obtained and installed in this environment, with the goal of exploring the
tools and methods needed for statistical production and the feasibility of
producing Big Data-derived statistics. Simple configurations with tools and
data will, whenever possible, be released in 'virtual machines' that partners
will be able to download in order to test them within their own technical
More details on Sandbox subproject in 5.4 below.
In such projects its very important to be sure that conclusion reached are
shared broadly throughout the statistical world and beyond.
This will be done through a variety of means, including:
a) establishing and maintaining an online infrastructure for documentation
and information-sharing on the UNECE wiki, including detailed
documentation from work packages 1 and 2;
b) preparation of electronic demonstrations of tools and results, for
example in the form of presentations and videos which can be
disseminated widely. Identification of existing electronic resources and
online training materials is also included in this strand;
c) a workshop in which the results of Sandbox subproject will be
presented to members of the various of expert groups involved in the
HLG's modernisation programme.
Definition of success
Overall, this project will be successful if it results in an improved
understanding within the international statistical community of the
opportunities and issues associated with using Big Data for the production of
official statistics. More detailed success criteria are:
To reach a consistent international view of Big Data opportunities,
challenges and solutions, documented and released through a public
web site;
Share recommendations on appropriate tools, methods and
environments for processing and analysing different types of Big Data,
and a report on the feasibility of establishing a shared approach for
using Big Data sources;
Exchange of knowledge and ideas between interested organizations and
a set of standard training materials;
5.4 Sandbox subproject
The Sandbox will evaluate the feasibility of the following propositions, and it
will demonstrate and document how the actions would be achievable in
statistical organizations:
1. 'Big Data' sources can be obtained (in a stable and replicable way),
installed and manipulated with relative ease and efficiency on the chosen
platform, within technological and financial constraints that are realistic
reflections of the situation of national statistical offices;
2. The chosen sources can be processed to produce statistics which conform
to the usual quality criteria used to assess official statistics, and which are
comparable across countries;
3. The resulting statistics correspond in a systematic and predictable way
with existing mainstream products, such as price statistics, household
budget indicators, etc.;
4. The chosen platforms, tools, methods and datasets can be used in similar
ways to produce analogous statistics in different countries;
5. The different participating countries can share tools, methods, datasets and
results efficiently, operating on the principles established in the Common
Statistical Production Architecture.
While the first objective is to examine these propositions (the 'proof of
concept'), a second objective is to then use these findings to produce a
general model for achieving the goal of producing statistics from Big Data,
and to communicate this effectively to statistical organizations. So, all
processes, findings, lessons learned and results will be recorded for
dissemination and training activities. In particular, experiences and best
practices for obtaining data will be detailed for the benefit of other
The task team considered a wide range of alternative possibilities for tools,
datasets and statistics and assessed them against various criteria. These
included the following:
Whether or not the tools are open source
Ease of use for statistical office staff
Possibilities for interoperability and integration with other tools
Ease of integration into existing statistical production architectures
Cost and licences
Availability of documentation
Availability of online tutorials and support
Training requirements, including whether or not a vendor-specific
language has to be learned
The existence of an active and knowledgeable user community.
Ease of locating and obtaining data from providers
Cost of obtaining data (if any)
Stability (or expected stability) over time
Availability of data that can be used by several countries, or data whose
format is at least broadly homogeneous across countries
The existence of identification variables which enable the merging of
Big Data sets with traditional statistical data sources
At least one statistic that corresponds closely and in a predictable way
with a mainstream statistic produced by most statistical organizations
One or more short term indicators of specific variables or cross-
sectional statistics which permits the study of the detailed relationships
between variables
One or more statistics that represent a new, non-traditional output (i.e.
something that has not generally been measured by producers of official
statistics, be it a novel social phenomenon or an existing one where the
need to measure it has only recently arisen)
In the following the recommendations from the task team that will have to be
followed for setting the "sandbox":
1. Processing environment: HortonWorks65 Hadoop distribution to be
installed on a cluster provided by a volunteering statistical organization;
2. Processing tools/software: the Pentaho Business Analytics 66 Suite
Enterprise Edition will be deployed under a free trial license obtained
for the purpose of the project. Pentaho Business Analytics Suite
provides a unified, interactive, and visual environment for data
integration, data analysis, data mining, visualization, and other
capabilities. Pentaho's Data Integration67 component is fully compatible
with Hortonworks Hadoop and allows 'drag and drop' development of
MapReduce jobs. In addition to Pentaho suite, tools such as R 68 and
RHadoop69 will be installed;
3. Datasets and statistics to be produced (or feasibility of production to be
demonstrated): one or more from each of the categories below have to
be installed in the sandbox and experimented with for the creation of
appropriate corresponding statistics:
a. Transactional sources (from banks/telecommunications
providers/retail outlets) to enable the recreation of standard official
statistics in easiest possible way, minimizing as far as possible
potential obstacles to access;
b. Sensor data sources;
c. Social network sources, image or video-based sources, other less-
explored sources (to enable the creation of 'new' statistics);
4. Human resource requirements: a task team will be identified at the
outset of the project, composed of experts whose time is volunteered in
kind by their respective organizations for the duration of the work
package. The project manager's first task will be to identify the number
of members required, the requisite skills and the amount of time to be
"We Do Hadoop. ^on Windows Too!" Hortonworks. Web. December 2013.
"Business Analytics from Pentaho - Leader in Business Analytics." Pentaho. Web. December 2013.
"Data Integration | Pentaho Business Analytics Platform." Pentaho. Web. December 2013.
"The R Project for Statistical Computing." The R Project for Statistical Computing. Web. December
"Public RevolutionAnalytics / RHadoop." GitHub. Web. December 2013.
committed by task team members to enable the work to progress.
The sandbox will be hosted at the Irish Center for High-End Computing
(ICHEC70) which has the mission to provide High-Performance Computing
resources, support, education and training for researchers in third-level
institutions. ICHEC is a partner in the European High Performance
Computing (HPC) service PRACE71, Partnership for Advanced Computing in
Europe [Prace 2012].
ICHEC will assist the task team to implement the Big Data environment for
the testing and evaluation of Hadoop work-flows and associated data analysis
application software.
The hardware on which the sandbox system is to be based is a High
Performance Computing Linux cluster hosted in the National University of
Ireland, Galway and is composed of 60 compute nodes each of which has two
Xeon quad-core processors, 48GB of RAM and a 1TB local disk. Each node
is connected to two networks one network for accessing the shared Lustre
filesystem and for high performance communications as well as a Gigabit
Ethernet network for management tasks. In addition, a 20TB shared
filesystem is available to all nodes.
ICHEC will dedicate to the sandbox project 20 compute nodes to enable a
Hadoop cluster with 160 cores almost 1TB of RAM and 20TB of HDFS
distributed storage. In addition, the Hortonworks Data Platform Hadoop
distribution will be installed on the cluster as well as the R and Pentaho
application suites. User accounts for up to 30 participants of the sandbox
project will be created to allow remote access to the system.
"Ireland's High-Performance Computing Centre." Ireland's High-Performance Computing Centre.
Web. December 2013.
"PRACE Research Infrastructure." PRACE Research Infrastructure. Web. December 2013.
5.5 Recommendations on Big Data coming from
international cooperation
Conclusions and future actions
For methodological issues other work-groups, coordinated by United Nations
agencies, are working to define a quality framework that allows for
statisticians to get on big data the same assurance of quality available today
on traditional data sources.
Finally, for organizational and legal issues, national and international
agencies are defining actions by which statistical data provider will offer
solutions that will give a new role to official statistics, a role that will allow
them to navigate safely in the data deluge
