Big Data PDF
Big Data PDF
Big Data PDF
BIG DATA
A Practical Guide To Transforming The Business of Government
Leadership
Steve Mills (Co-Chair) Steve Lucas (Co-Chair)
Senior Vice President and Group Executive Global Executive Vice President and General Manager,
IBM Database and Technology
SAP
Vice President Global Public Sector Chief Technology Officer, Science, Technology and Engineering
Amazon Web Services Group
Wyle
Commissioners
In recent years, federal, state and local government Although there clearly is intense focus on Big Data,
agencies have struggled to navigate the tidal wave of there remains a great deal of confusion regarding
sheer volume, variety, and velocity of data that is created what the term really means, and more importantly, the
within their own enterprise and across the government value it will provide to government agencies seeking
ecosystem. As this tidal wave has swept across to optimize service outcomes and innovate. This
government, “Big Data” has arisen as the new ubiquitous confusion may be due in part to the conversation
term. Everyone is talking about Big Data, and how it will being driven largely by the information technology
transform government, both in Washington and beyond community versus line of business community, and
the Beltway. Looking past the excitement, however, therefore centering primarily on technology. This
questions abound. What is Big Data? What capabilities report approaches Big Data from the perspective of the
are required to keep up? How do you use Big Data to key mission imperatives government agencies must
make intelligent decisions? How will agencies effectively address, the challenges and the opportunities posed
govern and secure huge volumes of information, by the explosion in data, and the business and inherent
while protecting privacy and civil liberties? Perhaps value Big Data can provide. The report breaks the
most importantly, what value will it really deliver to the discussion down into five chapters:
government and the citizenry it serves?
1. Big Data Definition & Business/Mission Value
In order to answer these questions and to provide
guidance to our federal government’s senior policy 2. Big Data Case Studies
and decision makers, the TechAmerica Foundation Big
Data Commission relied upon its diverse expertise and 3. Technical Underpinnings
perspectives, input from government representatives,
and previous reports. The Commission’s mandate 4. The Path Forward: Getting Started
was to demystify the term “Big Data” by defining its
characteristics, describe the key business outcomes it 5. Public Policy
will serve, and provide a framework for policy discussion.
• Big Data is a phenomenon defined by the rapid acceleration in the expanding volume
of high velocity, complex, and diverse types of data. Big Data is often defined along
three dimensions -- volume, velocity, and variety.
• Government leaders should strive to understand the “Art of the Possible” enabled by
advances in techniques and technologies to manage and exploit Big Data. Example of
use cases and live case studies are critical in understanding the potential of Big Data.
• While Big Data is transformative, the journey towards becoming Big Data “capable” will be iterative and cyclical,
versus revolutionary.
• Successful Big Data initiatives seem to start not with a discussion about technology, but rather with a burning
business or mission requirement that government leaders are unable to address with traditional approaches.
• Successful Big Data initiatives commonly start with a specific and narrowly defined business or mission
requirement, versus a plan to deploy a new and universal technical platform to support perceived future
requirements. This implies not a “build it and they will come” transformative undertaking, but rather a “fit for
purpose” approach.
• Successful initiatives seek to address the initial set of use cases by augmenting current IT investments, but do so
with an eye to leveraging these investments for inevitable expansion to support far wider use cases in subsequent
phases of deployment.
• Once an initial set of business requirements have been identified and defined, the leaders of successful initiatives
then assess the technical requirements, identify gaps in their current capabilities, and then plan the investments to
close those gaps.
• Successful initiatives tend to follow three “Patterns of Deployment” underpinned by the selection of one Big Data
“entry point” that corresponds to one of the key characteristics of Big Data – volume, variety and velocity.
• After completing their initial deployments, government leaders typically expand to adjacent use cases, building
out a more robust and unified set of core technical capabilities. These capabilities include the ability to analyze
streaming data in real time, the use of Hadoop or Hadoop-like technologies to tap huge, distributed data sources,
and the adoption of advanced data warehousing and data mining software.
5. Explore which data assets can be made open and 5. Provide further guidance and greater
available to the public to help spur innovation outside the collaboration with industry and stakeholders
agency. Consider leveraging programs like the Innovation on applying the privacy and data protection
Corps offered by the National Science Foundation, or the practices already in place to current
Start-Up America White House initiative. technology and cultural realities.
In recent years, federal, state, and local governments have come to face a tidal wave of
change as a result of the drastic increase in the sheer volume, variety and velocity of data
within their own enterprise and across the government ecosystem. For example, in 2011,
1.8 zetabytes of information were created globally, and that amount is expected to double
every year. This volume of data is the equivalent of 200 billion, 2-hour HD movies, which
one person could watch for 47 million years straight. The impact of this phenomenon to
business and government is immediate and inescapable.
Because of the Internet and influx of information from multiple sources embedded within
every fabric of our government, agencies will continue to struggle with managing large
streams of data. Our government has access to a constant barrage of data from sensors,
satellites, social media, mobile communications, email, RFID, and enterprise applications.
As a result, leaders are faced with capturing, ingesting, analyzing, storing, distributing,
securing the data, and transforming it into meaningful, valuable information.
Since 2000, the amount of information the federal government captures has increased
exponentially. In 2009, the U.S. Government produced 848 petabytes of data1 and U.S.
healthcare data alone reached 150 exabytes2 . Five exabytes (10^18 gigabytes) of data
would contain all words ever spoken by human beings on earth. At this rate, Big Data for
U.S. healthcare will soon reach zetabyte (10^21 gigabytes) scale and soon yottabytes
(10^24 gigabytes).
Yet, the mind-boggling volume of data that the federal government receives makes
information overload a fundamental challenge. In this expansion of data, there exists
new information that either has not been discoverable, or simply did not exist before. The
question is how to effectively capture new insight. Big Data properly managed, modeled,
shared, and transformed provides an opportunity to extract new insights, and make
decisions in a way simply not possible before now. Big Data provides the opportunity to
transform the business of government by providing greater insight at the point of impact
and ultimately better serving the citizenry, society and the world.
1 Source: IDC, US Bureau of Labor Statistics, McKinsey Global Institute Analysis
2 Roger Foster, “How to Harness Big Data for Improving Public Health,” Government Health IT, April 3, 2012, at http://www.govhealthit.com/news/how-harness-big-data-improving-public-health
• How do I secure and govern it? “Big Data is a term that describes
large volumes of high velocity,
• How do I improve cross-organizational
complex and variable data that
information sharing for broader
connected intelligence?
require advanced techniques and
technologies to enable the capture,
• How do I build trust in the data, storage, distribution, management,
through greater understanding of and analysis of the information.”
provenance and lineage tied back to
validated, trusted sources?
The quality and provenance of The quality of Big Data may Data-based decisions require
Veracity received data be good, bad, or undefined traceability and justification
due to data inconsistency &
incompleteness, ambiguities,
latency, deception, model
approximations
Although the Big Data challenge is daunting, it is not insurmountable, and the opportunity is compelling.
There are many possibilities and approaches to managing and leveraging Big Data to address the mission
of government inclusive of stakeholder requirements. Ultimately, this report will provide guidance for a
framework to extract the Big Data needed to analyze and use for effective decision making. This will be
the baseline for a continuous feedback process to improve upon the outcomes identified or potentially
eliminate programs that are not delivering on the desired outcomes.
The potential applications of Big Data described below serve to illustrate the “Art of the Possible” in the
potential value that can be derived. These applications are consistent with the recently published Digital
Government Strategy.3 They require a customer focus and the ability to reuse and leverage data in
innovative ways.
The ability to continuously improve quality and efficiency in the delivery of healthcare while reducing
costs remains an elusive goal for care providers and payers, but also represents a significant opportunity
to improve the lives of everyday Americans. As of 2010, national health expenditures represent 17.9%
of gross domestic product, up from 13.8% in 2000.4 Coupled with this rise in expenditures, certain
chronic diseases, such as diabetes, are increasing in prevalence and consuming a greater percentage of
healthcare resources. The management of these diseases and other health-related services profoundly
affects our nation’s well-being.
Big Data can help. The increased use of electronic health records (EHRs) coupled with new analytics
tools presents an opportunity to mine information for the most effective outcomes across large populations.
Using carefully de-identified information, researchers can look for statistically valid trends and provide
assessments based upon true quality of care.
Big Data in health care may involve using sensors in the hospital or home to provide continuous monitoring
of key biochemical markers, performing real time analysis on the data as it streams from individual high-
risk patients to a HIPAA-compliant analysis system. The analysis system can alert specific individuals and
their chosen health care provider if the analysis detects a health anomaly, requiring a visit to their provider
or a “911” event about to happen. This has the potential to extend and improve the quality of millions of
citizens’ lives.
3 http://www.whitehouse.gov/sites/default/files/omb/egov/digital-government/digital-government-strategy.pdf
4 http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/Downloads/tables.pdf
Transportation
Through improved information and autonomous features, Big Data has the potential to transform transportation in many ways.
The nemesis of many American drivers, traffic jams waste energy, contribute to global warming and cost individuals time and
money. Distributed sensors on handheld devices, on vehicles, and on roads can provide real-time traffic information that is
analyzed and shared. This information, coupled with more autonomous features in cars can allow drivers to operate more
safely and with less disruption to traffic flow. This new type of traffic ecosystem, with increasingly connected “intelligent cars,”
has the potential to transform how we use our roadways.5
Education
Big Data can have a profound impact on American education and our competitiveness in the global economy. For example,
through in-depth tracking and analysis of on-line student learning activities – with fine grained analysis down to the level of
mouse clicks – researchers can ascertain how students learn and the approaches that can be used to improve learning.
This analysis can be done across thousands of students rather than through small isolated studies.6 Courses and teaching
approaches, online and traditional, can be modified to reflect the information gleaned from the large scale analysis.
Big Data can transform improper payment detection and fundamentally change the risk and return perceptions of individuals
that currently submit improper, erroneous or fraudulent claims. For example, a significant challenge confronting the Centers
for Medicare and Medicaid Services (CMS) is managing improper payments under the Medicare Fee-For-Service Program
(FFS). The FFS distributes billions of dollars in estimated improper payments.7 Currently, contractors and employees identify
improper payments by selecting a small sample of claims, requesting medical documentation from the provider who submitted
the claims, and manually reviewing the claims against the medical documentation to verify the providers’ compliance with
Medicare’s policies.
This challenge is an opportunity to explore a use case for applying Big Data technologies and techniques, to perform
unstructured data analytics on medical documents to improve efficiency in mitigating improper payments. Automating
the improper payment process and utilizing Big Data tools, techniques and governance processes would result in greater
improper payment prevention or recovery. Data management and distribution could be achieved through an image
classification workflow solution to classify and route documents. Data analytics and data intelligence would be based
on unstructured document analysis techniques and pattern matching expertise.
The benefit is that the culture of submitting improper payments will be changed. Big Data tools, techniques and governance
processes would increase the prevention and recovery dollar value by evaluating the entire data set and dramatically
increasing the speed of identification and detection of compliance patterns.
5 http://www.forbes.com/sites/toddwoody/2012/09/19/automakers-on-the-road-to-self-driving-cars/
6 http://www.nytimes.com/2012/03/29/technology/new-us-research-will-aim-at-flood-of-digital-data.html
7 http://www.mcknights.com/gao-medicare-spent-48-billion-on-improper-fee-for-service-payments/article/208592/
Government agencies face numerous challenges The ability to better understand changes in the frequency,
associated with protecting themselves against cyber intensity, and location of weather and climate can benefit
attacks, such as managing the exponential growth in millions of citizens and thousands of businesses that rely
network-produced datat, database performance issues upon weather, including farmers, tourism, transportation,
due to lack of ability to scale to capture this data, and and insurance companies. Weather and climate-related
the complexity in developing and applying analytics natural disasters result in tens of billions of dollars in losses
for fraud to cyber data. Agencies continue to look at every year and affect the lives of millions of citizens. Much
delivering innovative cyber analytics and data intensive progress has been made in understanding and predicting
computing solutions. Cyber intelligence and other weather, but it’s far from perfect. New sensors and analysis
machine generated data are growing beyond the limits of techniques hold the promise of developing better long term
traditional database and appliance vendors. Therefore, climate models and nearer term weather forecasts.
requirements exist for fast data ingestion, data sharing,
and collaboration.
Agency/Organization/
Underpinning Big Data Initial Big Data
Company Public/User Benefits
Technologies Metrics Entry Point
Big Data Project Name
Case Studies and Use Cases
National Archive and Records Metadata, Submission, Petabytes, Warehouse Provides Electronic Records Archive
Administration (NARA) Access, Repository, Search Terabytes/sec, Optimization, and Online Public Access systems for
Electronics Records Archive and Taxonomy applications Semi-structured Distributed Info Mgt US records and documentary heritage
for storage and archival
systems
TerraEchos Streams analytic software, Terabytes/sec Streaming and Data Helps organizations protect and monitor
Perimeter Intrusion Detection predictive analytics Analytics critical infrastructure and secure
borders
Royal Institute of Technology of Streams analytic software, Gigabits/sec Streaming and Data Improve traffic in metropolitan areas by
Sweden (KTH) predictive analytics Analytics decreasing congestion and reducing
Traffic Pattern Analysis traffic accident injury rates
Vestas Wind Energy Apache Hadoop Petabytes Streaming and Data Pinpointing the optimal location for wind
Wind Turbine Placement & Analytics turbines to maximize power generation
Maintenance and reduce energy cost
University of Ontario (UOIT) Streams analytic software, Petabytes Streaming and Data Detecting infections in premature
Medical Monitoring predictive analytics, Analytics infants up to 24 hours before they
supporting Relational exhibit symptoms
Database
National Aeronautics and Space Metadata, Archival, Search Petabytes, Warehouse Provide industry and the public with
Administration (NASA) and Taxonomy applications Terabytes/sec, Optimization some of the most iconic and historic
Human Space Flight Imagery for tape library systems, Semi-structured human spaceflight imagery for scientific
GOTS discovery, education and entertainment
7
AM Biotechnologies (AM Cloud-based HPC genomic Gigabytes, 10 Streaming Data & Creation of a unique aptamer
Biotech) applications and DNA sequences Analytics, Warehouse compounds to develop improved
DNA Sequence Analysys for transportable data files compared Optimization, therapeutics for many medical
Creating Aptamers Distributed Info Mgt conditions and diseases
National Oceanic and HPC modeling, data from Petabytes, Streaming Data & Provide weather, water, and climate
Atmospheric Administration satellites, ships, aircraft and Terabytes/sec, Analytics, Warehouse data, forecasts and warnings for the
(NOAA) deployed sensors Semi-structured, Optimization, protection of life and property and
National Weather Service ExaFLOPS, Distributed Info Mgt enhancement of the national economy
PetaFLOPS
Internal Revenue Service (IRS) Columnar database Petabytes Streaming Data & Provide America's taxpayers top quality
Compliance Data Warehouse architecture, multiple Analytics, Warehouse service by helping them to understand
analytics applications, Optimization, and meet their tax responsibilities and
descriptive, exploratory, and Distributed Info Mgt enforce the law with integrity and
predictive analysis fairness to all
Centers for Medicare & Columnar and NoSQL Petabytes, Streaming Data & Protect the health of all Americans and
Medicaid Services (CMS) databases, Hadoop being Terabytes/day Analytics, Warehouse ensure compliant processing of
Medical Records Analytics looked at, EHR on the front Optimization, insurance claims
end, with legacy structured Distributed Info Mgt
database systems (including
DB2 and COBOL)
The case studies represent systems that have been in production. Given their maturity, some
case studies identify existing and new challenges that will require even greater evolution
of their IT infrastructures and technology to handle the level of compliance, retention and
accuracy required by current and new policies and by public and economic needs.
These new challenges create new use cases for needed Big Data solutions. These new use
cases demonstrate how Big Data technologies and analytics are enabling new innovations
in scientific, health, environmental, financial, energy, business and operational sectors. That
said, the men and women involved with the systems described in these case studies and use
cases are persevering. They are committed to the positive outcomes that Big Data solutions
offer: better protection, increased collaboration, deeper insights, and new prosperity. The
use cases will demonstrate that using the current and emerging technologies for Big Data
(including cloud-enabled Big Data applications) will drive new solutions to deliver insights
and information that benefits both the government and the public, thus enabling real-time
decision making.
The National Archive and Records Administration (NARA) has been charged with providing the Electronic Records Archive
(ERA) and Online Public Access systems for U.S. records and documentary heritage. As of January 2012, NARA is managing
manage about 142 terabytes (TB) of information (124 TB of which is managed by ERA), representing over 7 billion objects,
incorporating records from across the federal agency ecosystem, Congress and several presidential libraries, reaching back
to the George W. Bush administration. It sustains over 350 million annual online visits for information. These numbers are
expected to dramatically increase as agencies are mandated to use NARA in FY2012.
In addition to ERA, NARA is currently struggling to digitize over 4 million cubic feet of traditional archival holdings, including
about 400 million pages of classified information scheduled for declassification, pending review with the intelligence
community. Of that backlog, 62% of the physical records stored run the risk of never being preserved.
The NARA challenge represents the very essence of Big Data – how does the agency digitize this huge volume of
unstructured data, provide straightforward and rapid access, and still effectively governing the data while managing access
in both classified and declassified environments.
NARA has adopted an approach that put it on the path to developing the Big Data capability required to address its challenge.
This approach combines traditional data capture, digitizing, and storage capabilities with advanced Big Data capabilities
for search, retrieval, and presentation, all while supporting strict security guidelines. Dyug Le, Director of ERA Systems
Engineering writes, “It is best that the Electronic Records Archive be built in such a way so as to fit in a technology ecosystem
that can evolve naturally, and can be driven by the end users in ways that naturally ride the technology waves.” 8 The result
is faster ingestion and categorization of documents, an improved end user experience and dramatically reduced storage
costs. NARA notes the importance of record keeping in their drive toward electronic adoption and the cost benefit. It states,
“Electronic records can be duplicated and protected at less cost than paper records.” 9
8 http://ddp.nist.gov/workshop/ppts/01_05_Dyung_Le%20US_DPIF_NIST%20Digital%20Preservation%20Workshop.pdf
9 http://www.archives.gov/records-mgmt/policy/prod1afn.html
Researchers at KTH, Sweden’s leading technical university, wanted to gather in real-time a wide array of data that might
affect traffic patterns, in order to better managed congestion. This real-time sensor data includes GPS from large numbers of
vehicles, radar sensors on motorways, congestion charging, weather and visibility etc. The challenge was collecting the wide
variety of data at high velocity and assimilating it in real time for analysis.
Collected data is now flowing into a commercial off-the-shelf (COTS) Streams Analytics software, a unique software tool
that analyzes large volumes of streaming, real-time data, both structured and unstructured. The data is then used to help
intelligently identify current conditions, and estimate how long it would take to travel from point to point in the city, offer advice
on various travel alternatives, such as routes, and eventually help improve traffic in a metropolitan area.
• Uses diverse data, including GPS locations, weather conditions, speeds and flows
from sensors on motorways, incidents and roadworks
• Enters data into the Streams Analytics software, which can handle all types of data,
both structured and unstructured
• Handles, in real time, the large traffic and traffic-related data streams to enable
researchers to quickly analyze current traffic conditions and develop historical
databases for monitoring and more efficient management of the system
The result has been a decrease in traffic congestion and accidents in the target cities. KTH is now looking to expand the
capability to support routing of emergency services vehicles.
Since 1979, this Danish company has been engaged in the development, manufacture, sale, and maintenance of wind power
systems to generate electricity. Today, Vestas installs an average of one wind turbine every three hours, 24 hours a day,
and its turbines generate more than 90 million megawatt-hours of energy per year, enough electricity to supply millions of
households.
Making wind a reliable source of energy depends greatly on the placement of the wind turbines used to produce electricity in
order to optimize the production of power against wear and tear on the turbine itself.
For Vestas the process of establishing a location starts with its wind library, which combines data from global weather systems
with data collected from existing turbines. Data is collected from 35,000 meteorological stations scattered around the world
and from Vestas’s turbines. The data provides a picture of the global flow scenario, which in turn leads to mesoscale models
that are used to establish a huge wind library that can pinpoint the weather at a specific location at a specific time of day.
The company’s previous wind library provided detailed information in a grid pattern with each grid measuring 27x27 kilometers
(about 17x17 miles). Using computational fluid dynamics models, Vestas engineers can then bring the resolution down even
further—to about 10x10 meters (32x32 feet)—to establish the exact wind flow pattern at a particular location. However, in any
modeling scenario, the more data and the smaller the grid area, the greater the accuracy of the models. As a result, Vestas
wanted to expand its wind library more than 10 fold to include a larger range of weather data over a longer period of time.
To succeed, Vestas uses one of the largest supercomputers worldwide, along with a new Big Data modeling solution, to
slice weeks from data processing times and support 10 times the amount of data for more accurate turbine placement
decisions. Improved precision provides Vestas customers with greater business case certainty, quicker results, and
increased predictability and reliability in wind power generation.
The rapid advance of medical monitoring technology has done wonders to improve patient outcomes. Today, patients
routinely are connected to equipment that continuously monitors vital signs, such as blood pressure, heart rate and
temperature. The equipment issues an alert when any vital sign goes out of the normal range, prompting hospital staff to take
action immediately.
Many life-threatening conditions do not reach critical level right away, however. Often, signs that something is wrong begin
to appear long before the situation becomes serious, and even a skilled and experienced nurse or physician might not be
able to spot and interpret these trends in time to avoid serious complications. One example of such a hard-to-detect problem
is nosocomial infection, which is contracted at the hospital and is life threatening to fragile patients such as premature
infants. According to physicians at the University of Virginia, an examination of retrospective data reveals that, starting
12 to 24 hours before any overt sign of trouble, almost undetectable changes begin to appear in the vital signs of infants
who have contracted this infection. Although the information needed to detect the infection is present, the indication is very
subtle; rather than being a single warning sign, it is a trend over time that can be difficult to spot, especially in the fast-paced
environment of an intensive care unit.
The University of Ontario’s Institute of Technology partnered with researchers from a prominent technology firm that was
extending a new stream-computing platform to support healthcare analytics. The result was Project Artemis -- a highly flexible
platform that aims to help physicians make better, faster decisions regarding patient care for a wide range of conditions.
The earliest iteration of the project is focused on early detection of nosocomial infection by watching for reduced heart rate
variability along with other indications.
Project Artemis is based on Streams analytic software. An underlying relational database provides the data management
required to support future retrospective analyses of the collected data.
The result is an early warning that gives caregivers the ability to proactively deal with potential complications—such as
detecting infections in premature infants up to 24 hours before they exhibit symptoms. This system:
As the nucleus of the nation’s astronaut corps and home to International Space Station (ISS) mission operations, NASA
Johnson Space Center (JSC) plays a pivotal role in surpassing the physical boundaries of Earth and enhancing technological
and scientific knowledge to benefit all of humankind. NASA JSC manages one of the largest imagery archives in the world
and has provided industry and the public with some of the most iconic and historic human spaceflight imagery for scientific
discovery, education and entertainment. If you have seen it at the movies or on TV, JSC touched it first.
NASA’s imagery collection of still photography and video spans more for than half a century: from the early Gemini and Apollo
missions to the Space Station. This imagery collection currently consists of over 4 million still images, 9.5 million feet of
16mm motion picture film, over 85,000 video tapes and files representing 81,616 hours of video in analog and digital formats.
Eight buildings at JSC house these enormous collections and the imagery systems that collect, process, analyze, transcode,
distribute and archive these historical artifacts. NASA’s imagery collection is growing exponentially, and its sheer volume of
unstructured information is the essence of Big Data.
NASA’s human spaceflight imagery benefits the public through the numerous organizations that create media content for
social and public consumption. It is also used by the scientific and engineering community to avoid costly redesigns and to
conduct scientific and engineering analysis of tests and mission activities conducted at NASA JSC and White Sands Test
Facility.
NASA has developed best practices through technologies and processes to:
Key lessons learned have revolved around data lifecycle management and NASA’s Imagery Online records management tool
which contributes to the state of the art in records management.
Technology Underpinnings
Introduction
Powerful advances in new information management and business analytics technologies, like map reduce frameworks (such
as Hadoop, and Hadoop-like technologies), stream analytics, and massive parallel processing (MPP) data warehouses, have
been proven in deployment to support government agencies in harnessing the value from the increased volume, variety and
velocity that characterize Big Data. This section describes the different capabilities provided by these technologies, and how
they are combined in different ways to provide unique solutions. No single technology is required for a “Big Data Solution” –
they are not “must haves” – as initial Big Data Deployments are unique to the individual agency’s business imperatives and
use cases. The technologies can be placed, however, into the context of a broader enterprise Big Data model. The model
below highlights the ecosystem of technologies that can be used to support Big Data solutions, coupling legacy investments to
new technologies. As a result, the technologies listed in the model are not all new, but can be used as part of the solution set
(see Figure 1).
Accelerators
Text Statistics Financial Geospatial Acoustic
Legacy IT Ecosystem
Core Technologies
Integration Layer
Social Sites
Realtime Analytics Map Reduce Data Warehouse/
& Streaming Frameworks Database
Video/Pictures
Infrastructure
Mobile Devices
Provisioning Workflow Cloud/
Security Virtualization
Job
Documents Job Tracking
Scheduling
Storage Data
Configuration Infrastructure Compression
Admin Tools
Manager
A suitable technology infrastructure is a key prerequisite for embarking on a successful Big Data strategy. The proper
infrastructure is required to leverage the data that originates from the varied applications and data sources. Many government
agencies operate a diverse collection of systems based on a wide variety of technology architectures. Adjustments in data
center infrastructure may be necessary to implement a Big Data platform. For example, additional dedicated data storage
may be necessary to manage massive amounts of unstructured information. Network bandwidth is also a concern of current
environments.
• Imagery and video mining, marking, • Identify and prioritize the information projects that
monitoring, and alerting capability or deliver the most value
interfacing support
A Big Data information plan ensures that the major
components of the governance plan are working in unison.
There are four components to an effective information plan:
In practice, Big Data technologies can be integrated to provide a comprehensive solution for government IT leadership. Big
Data solutions are often used to feed traditional data warehousing and business intelligence systems. For example, Hadoop
within a Big Data model can be used as a repository for structured, semi-structured (e.g., log file data), and unstructured
data (e.g., emails, documents) that feeds an OLAP data warehouse. The analytics and visualization tools pull data from the
OLAP data warehouse and render actionable information through reports. However, the fundamental step of cleansing the
data prior to loading the OLAP data warehouse is absolutely critical in developing “trusted information” to be used within the
analytics.
Technologies are evolving to provide government IT leadership choices of characteristics and costs. For example, NoSQL
and Hadoop technologies include the ability to scale horizontally without a pre-defined boundary. These technologies
may run on commodity hardware or can be optimized with high-end hardware technology tuned specifically to support Big
Data. Similarly, NoSQL and Hadoop have different characteristics and capabilities than traditional RDBMSs and analysis
tools. The ideal enterprise data warehouse has been envisaged as a centralized repository for 25 years, but the time has
come for a new type of warehouse to handle Big Data. MapReduce, Hadoop, in-memory databases and column stores
don’t make an enterprise data warehouse obsolete. The new enterprise data warehouse will leverage all of these software
technologies in the RDBMS or via managed services. This “logical data warehouse” requires realignment of practices and a
hybrid architecture of repositories and services. Software alone is insufficient — it will demand the rethinking of deployment
infrastructures as well.
Scalable analytics using software frameworks can be combined with storage designs that support massive growth for cost-
effectiveness and reliability. Storage platform support for Big Data can include multi-petabyte capacity supporting potentially
billions of objects, high-speed file sharing across heterogeneous operating systems, application performance awareness and
agile provisioning. Agencies must consider data protection and availability requirements for their Big Data. In many cases
data volumes and sizes may be too large to back up through conventional methods. Policies for data management will need
to be addressed as well; the nature of many Big Data use cases implies data sharing, data reuse and ongoing analysis.
Security, privacy, legal issues such as intellectual property management and liability, and retention of data for historical
purposes need to be addressed.
Networking Considerations
Consistent with federal CIO policy, Big Data presents federal IT leadership with options for deployment infrastructure that can
include cloud computing for development, test, integration, pilots and production. Cloud computing approaches can allow for
faster deployment, more effective use of scarce IT resources, and the ability to innovate more quickly. Innovation is enabled
through the dynamic use of low cost virtual environments that can be instantiated on demand. This allows organizations to
succeed quickly or fail quickly and incorporate their lessons learned. As with any deployment option, cloud computing should
be evaluated against the ability to meet application and architecture requirements within the cost structure of the organization.
Source & Data Data Preparation Data Transformation Business Intelligence Analysts
Applications Metadata Repository Decision Support
Data Mining
Streaming Data
& Statistics
Other
Social Network New Analysts and
Algorithms Users
A notional flow of information can be a helpful approach in understanding how the technologies underpinning the Big
Data Enterprise Model can come together to meet government’s needs. This flow follows a simple Understand, Cleanse,
Transform and Exploit model. At the end of the day, the key is to map available data sources through the analytic
process to the target use cases.
The first step in any information integration and transformation initiative – Big Data or otherwise – is to identify and
understand the relevant sources of data, their degree of volume, variety and velocity, and their level of quality. This
understanding helps determine the degree of difficulty in accessing the data, the level of transformation required, and
the core Big Data technologies that will enable to agency to manage and exploit it.
Once an agency understands data sources in the context of the target use case, it can begin to define the method and
manner of the data preparation required to support the target use case. For example, unstructured data may require a simple
pass through for direct analysis or it may be filtered, cleaned and used for downstream processing. Structured information
– such as addresses, phone numbers, and names – may require standardization and verification. Specifics depend on
operational and business requirements.
Data Transformation
Once the data fueling a Big Data use case has been cleansed and verified, agencies may consider additional transformation
to extract insight and enhance its value. Data may be available for direct analysis (e.g., through Hadoop) or may need
additional processing before it can be fully leveraged for the intended use case. Unstructured data, for instance, may
be broken down and rendered in a structured format – an agency may want to perform entity extraction to associate
organizational or individual names with specific documents. Further, agencies may seek to aggregate data sources, or to
understand the non-obvious relationships between them. The goal is trusted information – information that is accurate,
complete, and insightful – such that every nugget of information has been extracted from the base data.
Once trusted information has been established, agencies can then use the broadest range of analytic tools and techniques
to exploit it. These tools and techniques range from the most basic business intelligence capabilities, to more sophisticated
predictive analytics, to anomaly detection, content analytics, sentiment analytics, imagery, aural analytics and biometrics.
Once the data is brought into the Big Data environment, the critical step is to process it to glean new insights. For example,
Hadoop can be used to analyze unstructured data residing on many distributed compute instances or business intelligence
tools can be used to analyze a structured data warehouse.
Analysts/Visualization
The final step in the information supply chain is to deliver the new insights created in the previous steps in a manner that most
effectively supports the business requirements and use cases. This implies one of two approaches, depending on the users
and the business requirement. Either the collection of new insights is provided through a visualization or collaboration tool
that enables the users to explore the information, ask questions and uncover new insights; or, the information is simply used
to fuel existing work process applications to improve existing processes. Either way, the user needs to be provisioned with
data that meets the Big Data needs.
So what have we learned? In our discussion with leaders from across the government
ecosystem, and examining the case studies, five central themes have emerged -- each
of which coincidentally is well aligned with established engineering best practices:
• Velocity: Use cases requiring both a high degree of velocity in data processing
and real time decision making, tend to require Streams as an entry point.
• Volume: Government leaders struggling with the sheer volume in the data they
seek to manage, often select as an entry point a database or warehouse architecture
that can scale out without pre-defined bounds.
• Variety: Those use cases requiring an ability to explore, understand and analyze
a variety of data sources, across a mixture of structured, semi-structured and
unstructured formats, horizontally scaled for high performance while maintaining low
cost, imply Hadoop or Hadoop-like technologies as the entry point.
4. Identify gaps: Once an initial set of business requirements have been identified and defined, government IT leaders
assess their technical requirements and ensure consistency with their long term architecture. Leaders should identify gaps
against the Big Data reference architecture described previously, and then plan the investments to close the gaps.
5. Iterate: From these Phase I deployments clients typically then expand to adjacent use cases, building out a more robust
and unified Big Data platform. This platform begins to provide capabilities that cut across the expanding list of use cases,
and provide a set of common services to support an ongoing initiative. Beyond adding in adjacent core capabilities – e.g.,
Streams, Hadoop, Warehousing and Streams, these services often include:
Based on these observations, the Big Data Commission recommends the following
five step cyclical approach to successfully take advantage of the Big Data opportunity.
These steps are iterative versus serial in nature, with a constant closed feedback loop
that informs ongoing efforts. Critical to success is to approach the Big Data initiative
from a simple definition of the business and operational imperatives that the government
organization seeks to address, and a set of specific business requirements and use cases
that each phase of deployment with support. At each phase, the government organization
should consistently review progress, the value of the investment, the key lessons learned,
and the potential impacts on governance, privacy, and security policies. In this way, the
organization can move tactically to address near term business challenges, but operate
within the strategic context of building a robust Big Data capability.
Review
Define the Big Data • Identify key business challenges, and potential use cases to address
opportunity including the • Identify areas of opportunity where access to Big Data can be used to better
key business and mission serve the citizenry, the mission, or reduce costs
challenges, the initial use • Ask – does Big Data hold a unique promise to satisfy the use case(s)
Define
case or set of use cases, • Identify the value of a Big Data investment against more traditional analytic
and the value Big Data can investments, or doing nothing
deliver • Create your overall vision, but chunk the work into tactical phases (time to value
within 12-18 month timeframe)
• Don’t attempt to solve all Big Data problems in the initial project – seek to act
tactically, but in the strategic context of your key business imperatives
Assess the organization’s • Assess the use case across velocity, variety and volume requirements, and
currently available data determine if they rise to the level of a Big Data initiative, versus a more traditional
and technical capabilities, approach
against the data and • Assess the data and data sources required to satisfy the defined use case,
technical capabilities versus current availability
Assess
required to satisfy the • Assess the technical requirements to support accessing, governing, managing
defined set of business and analyzing the data, against current capability
requirements and use • Leverage the reference architecture defined in the report above to identify key
cases gaps
• Develop an ROI assessment for the current phase of deployment (ROI used
conceptually, as the ROI may be better services for customers/citizens and not
necessarily a financial ROI)
Select the most appropriate • Identify the “entry point” capability as described in the section above
deployment pattern and • Identify successful outcomes (success criteria)
entry point, design the “to • Develop architectural roadmap in support of the selected use case or use cases
be” technical architecture, • Identify any policy, privacy and security considerations
Plan
flexibility to leverage its • Build out the Big Data platform as the plan requires, with an eye toward flexibility
investment to accommodate and expansion
subsequent business • Deploy technologies with both the flexibility and performance to scale to support
requirements and use cases subsequent use cases and corresponding data volume, velocity and variety
Review
• This is a continual process that cuts across the remainder of the roadmap steps
• Throughout the assess and planning stages, continually review plans against set
governance, privacy, security policies
The gov’t agency continually
• Assess big data objectives against current Federal, state and local policy
reviews progress, adjusts the
• At each stage, assess ROI, and derive lessons learned
deployment plan as required,
and tests business process, • Review deployed architecture and technologies against the needs of the broader
policy, governance, privacy organization – both to close any gaps, as well as to identify adjacent business areas
that might leverage the developing Big Data capability
and security considerations
• Move toward Big Data Transformation in a continual iterative process
Public Policy
Accelerating Uptake of Big Data in the Federal Government
As the federal government moves to leverage Big Data, it must look closely at current
policies and determine whether they are sufficient to ensure it can maximize its promise.
Issues as varied as procurement processes for acquiring Big Data solutions, research and
development funding and strategies, and workforce development policies will have a dramatic
effect on the success or limitations of federal Big Data efforts. Furthermore, understanding
and addressing citizens’ expectations on privacy and security is critical for government
to implement Big Data solutions successfully. The government should evaluate Big Data
policy issues with a view toward removing the unnecessary obstacles to Big Data use, and
driving specific actions to accelerate the use and dissemination of the government’s data
assets. In this section of the report, we address specific strategic policy areas where careful
consideration by policy makers can greatly accelerate uptake of Big Data efforts.
The culture of information sharing and decision making needs to grow to include Big Data
analysis. Without providing decision-makers the policy; and the incentives to use Big
Data for insights and predictions; and the guidance on the use and sharing of information,
government will not realize the tremendous value Big Data can offer.
The best private companies today are relentlessly data-driven. They have led the way in
industry in measuring their performance carefully and in using hard quantitative information to
guide plans and decisions. Unfortunately, in the federal government, daily practice frequently
undermines official policies that encourage sharing of information both within and among
agencies and with citizens. Furthermore, decision-making by leaders in Congress and the
Administration often is accomplished without the benefit of key information and without
using the power of Big Data to model possible futures, make predictions, and fundamentally
connect the ever increasing myriad of dots and data available. As recognized in a recent
Government Accountability Office (GAO) report,10 Congress may miss opportunities to use
performance information produced by agencies. Such data could be leveraged to enact
targeted law and policy. So too, the Administration may miss the chance to leverage real-
time data as it fulfills program missions. Both branches of government stand to benefit from
enhancing policies directed at the use of Big Data approaches as part of their daily routine.
10 Managing for Results: A Guide for Using the GPRA Modernization Act to Help Inform Congressional Decision Making (gao.gov/assets/600/591649.pdf).
• Name a single official both across government and within each agency
to bring cohesive focus and discipline to leveraging the government’s data
assets to drive change, enhance performance, and increase competitiveness.
To be certain the value of data is properly harnessed across organizational
boundaries, Big Data should not be seen as the exclusive domain of the IT
organization or CIO.
These efforts should take a strategic view of the importance of leveraging Big
Data for new ways of optimizing operations. It will also be important to clearly
define roles and responsibilities to ensure that the discipline of Big Data is not
overlooked, but rather is integrated into daily agency operations and decision-
making.
11 www.fcc.gov/data/chief-data-officers
Education
and Workforce Development
The Commission recommends that the OSTP further B. New management and analytic tools to address
develop a national research and development strategy increasing data velocity and complexity;
for Big Data that encourages research into new
techniques and tools, and that explores the application of C. Novel privacy-enhancing and data management
those tools to important problems across varied research technologies;
domains. Such research could be facilitated, and cost
could be optimized, by considering the establishment of D. Development of new economic models to
experimental laboratories within which agencies could encourage data sharing and monetization of Big Data
explore and test new Big Data technologies without the solutions and collaborative data science efforts;
need to spend funds on their own. This strategy should
not be limited to the major research organizations in E. Continued research and development of advanced
the federal government, but rather focus on the roles computing technologies that can effectively process,
and skill sets of all levels of government and create not only the vast amounts of data being continually
incentives for the private sector to innovate and develop generated, but also the various types of data; and
transformative solutions.
Privacy Issues
As the earlier best practice section reveals, good data 1. Ways to simplify the understanding of privacy management
protection practices are in place, as evidenced by the obligations through the issuance of one guiding set of
existing policies and industry best practices related to principles and practices to be applied by all agencies. Along
privacy and security by design context, accountability, with the streamlining of access to what management practices
and transparency. The Commission does not believe are required, this guidance also should explicitly note the
that realizing the promise of Big Data requires the penalties for violating privacy requirements.
sacrifice of personal privacy. Accordingly, education
and execution on existing privacy protections and 2. The government’s use of data that is not part of a Privacy
procedures must be part of any Big Data project. Act System of Records.
There are over 40 laws12 that provide various forms of 3. The growing concern about the use, aggregation, sharing,
privacy protections. Some, such as the Privacy Act retention, and deletion of data with an eye toward identifying
of 1974, provide protections to personal data used by best practices that may be needed to realize Fair Information
government; others are more sector-specific, such as Practice Principles (FIPPs).
the Health Insurance Portability and Accountability Act
(HIPAA), and the Financial Modernization Act (Gramm- 4. The need to recognize and promote citizen confidence by
Leach-Bliley Act 1999), which concern health and communicating clearly that, for many of Big Data projects,
financial information, respectively. Further guidance the data being aggregated are non-regulated, de-identified
from OMB, and greater collaboration with industry and information with no need to re-identify to derive benefits.
stakeholders, on applying these protections to current Of particular note should be the fact that data sets can be
technology and cultural realities could help accelerate used without personally identifiable information (PII) and still
the uptake of Big Data. Specifically, we recommend yield positive results (e.g., disease spread predictions).
that OMB strongly collaborate with industry and
advocacy groups and issue additional guidance that 5. The use of clear public policies regarding notice and
addresses: access for Big Data to enable choice and transparency, as
articulated by the FIPPs. Along these lines, stakeholders
could explore a “layered notice” approach, where basic
information is available for all, with notice/consent/access
requirements scaled to instances where more information is
sought.
12 http://healthit.hhs.gov/portal/server.pt/gateway/PTARGS_0_11113_911059_0_0_18/Federal%20Privacy%20Laws%20Table%202%2026%2010%20Final.pdf
Removing Barriers
to Use through Procurement Efficiencies
The acquisition process for IT increasingly involves multiple buying vehicles with low
minimum government commitments, compelling industry to participate through numerous,
duplicative contracting channels to maximize the ability to win government business.
A recent Bloomberg study indicates that the number of multiple award contracts in the
federal government has more than doubled from 2007 to 2011.13 This process continues
to inflate administrative costs with no appreciable value added, and threatens to reduce
incentives to participate in the government market.
Big Data continues to evolve rapidly, driven by innovation in the underlying technologies,
platforms, and analytic capabilities for handling data, as well as changes in user behavior.
Given this rapid evolution, coupled with the scarcity of funds competing for public sector
initiatives, the proliferation of contract vehicles, especially at the agency level, creates
potential problems for its adoption by government. The process- and cost-intensive
nature of the current contracting trend creates a barrier to industry vendors offering
cutting edge solutions, and at the same time, it hinders federal government efforts to
deploy Big Data capabilities in standard approaches across the landscape of disparate
federal agencies.
The government should avoid Big Data contract vehicle duplication by promoting an
express preference for the use of existing successful government-wide vehicles for Big
Data solutions and use of the minimum number of buying vehicles necessary to fulfill its
needs, including the Federal Supply Schedules program. Although that program would
have to be modified to address some shortcomings, which can restrict its use in the
acquisition of complex solutions, it could supplement GWACs, and together, these existing
contract vehicles could be leveraged fully before any consideration is given to creating
new contracting vehicles. While having the right contracting vehicles is important, the
government model for buying technology also needs to be updated to allow for a shift,
when appropriate, from a capital expenditure (CapEx) to an operating expense (OpEx)
model to pay for usage.
The government has within its management purview the means to craft buying
solutions for Big Data without turning to new contracting vehicles that will add cost and
administrative burden to the implementation of Big Data solutions. Channeling Big
Data solutions through the minimum number of contract vehicles necessary would
allow maximum integration, strategic sourcing, governance, and standardization.
13 “The Future of Multiple-Award Contracting” Conference, Presentation by Brian Friel, Bloomberg Government, July 18, 2012.
Conclusion
We live in an exciting time, when the scale and scope of value that data can bring is coming to an inflection point, set to
expand greatly as the availability of Big Data converges with the ability to affordably harness it. Hidden in the immense
volume, variety and velocity of data that is produced today is new information – facts, relationships, indicators and
pointers -- that either could not be practically discovered in the past, or simply did not exist before. This new information,
effectively captured, managed, and analyzed, has the power to change every industry including cyber security, healthcare,
transportation, education, and the sciences.
To make data a strategic asset that can be used to better achieve mission outcomes, data should be included in the strategic
planning, enterprise architecture, and human capital of each agency. These precepts are embodied in Digital Government
Strategy, a primary component of which is to “unlock the power of government data to spur innovation across our nation and
improve the quality of services for the American people.”
Within this report, we have outlined the steps each government agency should take toward adopting Big Data solutions,
including the development of data governance and information plans. The Commission recommends that the OSTP further
develop a national research and development strategy for Big Data that encourages research into new techniques and tools,
and that explores the application of those tools to important problems across varied research domains. We recommend that
the OMB strongly collaborate with industry and advocacy groups and issue additional guidance that addresses privacy issues.
Because of the importance of data in the digital economy, the Commission encourages each agency to follow the FCC’s
decision to name a Chief Data Officer. To generate and promulgate a government-wide data vision, to coordinate activities,
and to minimize duplication, we recommend appointing a single official within the OMB to bring cohesive focus and discipline
to leveraging the government’s data assets to drive change, improve performance, and increase competitiveness.
By following these recommendations, the early successes we described at NARA, NASA, NOAA, IRS, CMS, and the
Department of Defense can be expanded and leveraged across government to reduce cost, increase transparency, and
enhance the effectiveness of government ultimately better serving the citizenry, society, and the world.
• George Strawn, National Coordination Office for Networking and Information Technology
Research and Development
• Wendy Wigen, National Coordination Office for Networking and Information Technology
Research and Development
*The Commission especially thanks these individuals for their significant contributions to the development of the Commission’s work.