5 Data and Knowledge Management: Opening Case

Download as pdf or txt
Download as pdf or txt
You are on page 1of 95

CHAPTER 5

Data and Knowledge Management


CHAPTER OUTLINE
5.1 Managing Data
5.2 The Database Approach
5.3 Big Data
5.4 Data Warehouses and Data Marts
5.5 Knowledge Management
5.6 Appendix: Fundamentals of Relational Database Operations

LEARNING OBJECTIVES
5.1 Discuss ways that common challenges in managing data can be
addressed using data governance.
5.2 Identify and assess the advantages and disadvantages of
relational databases.
5.3 Define Big Data and explain its basic characteristics.
5.4 Explain the elements necessary to successfully implement and
maintain data warehouses.
5.5 Describe the benefits and challenges of implementing
knowledge management systems in organizations.
5.6 Understand the processes of querying a relational database,
entity-relationship modeling, and normalization and joins.

Opening Case
Data from a Pacemaker Leads to an Arrest

On September 19, 2016, a fire broke out at Ross Compton’s home


in Middletown, Ohio. Compton informed law enforcement that he
awoke to find his home on fire. Authorities noted that before
Compton escaped from his house, he packed some of his clothes
and other belongings in a suitcase. He put anything that did not fit
in his suitcase in other bags. He also took his computer and the
charger for his pacemaker. (A pacemaker is a small device placed
in the chest or abdomen to help control abnormal heart rhythms.)
He then broke through a window with his cane and threw his
suitcase and bags out of it. The blaze ended up causing $400,000
of damage.
Fire department investigators determined that the fire originated
from multiple locations within the house, leading them to conclude
that the fire resulted from arson. In addition, investigators said
Compton’s house smelled of gasoline and that his account of what
happened was inconsistent with the available evidence.
Investigators had to make the case for arson and insurance fraud.
They realized that they had a way to corroborate how much
Compton was exerting himself, and for how long, before he
escaped the fire: Compton’s pacemaker.
Once police learned about Compton’s pacemaker, they obtained a
search warrant for the data recorded on it. They believed that the
data would reveal his heart rate and cardiac rhythms before,
during, and after the fire. Medical technicians downloaded the data
—the same data that would routinely be retrieved from a
pacemaker during an appointment with a physician—from the
device. Law enforcement officials then subpoenaed those records.
The data did not corroborate Compton’s version of the events that
occurred that night.
Authorities alleged that the data revealed that Compton was awake
when he claimed to be sleeping. In addition, a cardiologist
reviewed the data and concluded that Compton’s medical
condition made it unlikely that he would have been able to collect,
pack, and remove numerous large and heavy items from the house,
exit his bedroom window, and carry the items to the front of his
residence during the short period of time that Compton had
indicated to the authorities.
In late January 2017, a Butler County, Ohio, grand jury indicted
Compton on felony charges of aggravated arson and insurance
fraud for allegedly starting the fire. Compton pleaded not guilty to
the charges the following month.
Compton’s defense attorney filed a motion to suppress the
pacemaker data evidence as an unreasonable seizure of Compton’s
private information. Prosecutors argued that police have
historically obtained personal information through search
warrants and that doing so for a pacemaker should not be treated
differently. For example, law enforcement can use legally obtained
blood samples and medical records as evidence.
Investigators have also recently used data from other smart
devices, such as steps counted by activity trackers and queries
made to smart speakers, to establish how a crime was committed.
In Connecticut, for example, Richard Dabate was charged with
murdering his wife after police built a case based, in part, on the
victim’s Fitbit data. (See the opening case in Chapter 8.) Despite
Compton’s attorney’s arguments to the contrary, Butler County
Judge Charles Pater held that the data from Compton’s pacemaker
could be used against him in his upcoming trial.
Compton had been free on his own recognizance since his
indictment, but he did not appear for a pretrial hearing in Butler
County Common Pleas Court. Judge Charles Pater revoked
Compton’s bond. In late July 2018, Compton was back in police
custody.
This case was the first in which police obtained a search warrant
for a pacemaker. If a person is dependent on an embedded medical
device, should the device that keeps him alive also be allowed to
incriminate him in a crime? After all, the Fifth Amendment of the
U.S. Constitution protects a person from being forced to
incriminate himself or herself. When Ross Compton had a
pacemaker installed, he, as an individual, had a constitutional right
to remain silent. However, the electronic data stored in his
pacemaker eventually led to his arrest and subsequent indictment
on charges of arson and insurance fraud.
In Compton’s case, the data were obtained from a device inside his
body rather than a device in his home or worn on his wrist. In
March 2019, the Butler County judge ruled that using pacemaker
data was not stealing personal information. That is, data are not
considered more protected or more private by virtue of its personal
nature or where it is generated or stored. Defense attorneys are
appealing the judge’s decision to the 12th District Court of Appeals.
A new trial date for Compton has not been set.
The more connected, convenient, and intelligent our devices
become, the more they have the potential to expose the truth.
There is nothing new about consumers using tracking and
recording devices. Each day, all of us leave revealing data trails—
called “data exhaust”—and prosecutors are realizing how valuable
these data can be in solving crimes. In addition, new data sources
are constantly emerging. For example, autonomous (self-driving)
cars may record our speed, distance traveled, and locations. Homes
can tell which rooms are occupied, and smart appliances such as
refrigerators can track our daily routines. Significantly, all of these
different types of data can be used as incriminating evidence.
These technologies represent a new frontier for criminal litigation
and law enforcement. In a time when consumers constantly reveal
intimate data, perhaps privacy is becoming a thing of the past.
The Compton case may be one of the first Internet of Things (see
Chapter 8) prosecutions, but it certainly will not be the last. Since
Compton’s arrest, Ohio police departments have used similar data
in two homicide investigations. If other courts accept Judge Pater’s
ruling on the admissibility of Compton’s pacemaker data, then
consumers might have to accept the reality that using smart
technologies may cause them to forfeit whatever is left of their
privacy.
Sources: Compiled from L. Pack, “Is Using Pacemaker Data ‘Stealing
Personal Information’? Judge in Middleton Arson Case Says No,”
Middleton Journal-News, March 7, 2019; L. Pack, “Arson Suspect in
Unique Case Featuring Pacemaker Data Is Back in Custody,” Middleton
Journal-News, July 24, 2018; L. Pack, “His Pacemaker Led to Arson
Charges. Then He Failed to Show Up for Court, so Police Are Looking for
Him,” Middleton Journal-News, March 6, 2018; D. Paul, “Your Own
Pacemaker Can Now Testify Against You in Court,” Wired, July 29, 2017; G.
Ballenger, “New Form of Law Enforcement Investigation Hits Close to the
Heart,” Slate, July 19, 2017; L. Pack, “2 More Investigations Where
Middletown Police Used Pacemaker Data,” Middletown Journal-News,
July 14, 2017; M. Moon, “Judge Allows Pacemaker Data to be Used in
Arson Trial,” Engadget, July 13, 2017; “Man’s Pacemaker Data May Sink
Him in Court,” Newser, July 12, 2017; L. Pack, “Judge: Pacemaker Data Can
Be Used in Middletown Arson Trial,” Middletown Journal-News, July 11,
2017; C. Hauser, “In Connecticut Murder Case, a Fitbit Is a Silent Witness,”
New York Times, April 27, 2017; D. Boroff, “Data from Pacemaker Used as
Evidence against Ohio Man Accused of Burning Down His House,” New
York Daily News, February 8, 2017; C. Wootson, “A Man Detailed His
Escape from a Burning House. His Pacemaker Told Police a Different
Story,” Washington Post, February 8, 2017; J. Wales, “Man’s Pacemaker
Used to Track and Charge Him with Crime,” AntiMedia, February 7, 2017;
D. Reisinger, “How a Pacemaker Led Police to Accuse Someone with
Arson,” Fortune, February 7, 2017; “Cops Use Pacemaker Data to Charge
Homeowner with Arson, Insurance Fraud,” CSO Online, January 30, 2017;
L. Pack, “Data from Man’s Pacemaker Led to Arson Charges,” Middletown
Journal-News, January 27, 2017; and L. Pack, “Middletown Homeowner
Facing Felonies for House Fire,” Middletown Journal-News, January 26,
2017.

Questions
1. The Electronic Frontier Foundation released a statement that
Americans should not have to make a choice between health
and privacy (www.eff.org/issues/medical-privacy). Do
you agree with this statement? Why or why not? Support your
answer.
2. As we noted in Chapter 1, you are known as Homo conexus
because you practice continuous computing and are
surrounded with intelligent devices. Look at Chapter 3 on
ethics and privacy, and discuss the privacy implications of
being Homo conexus.
Introduction
Information technologies and systems support organizations in
managing—that is, acquiring, organizing, storing, accessing, analyzing,
and interpreting—data. As noted in Chapter 1, when these data are
managed properly, they become information and then knowledge.
Information and knowledge are invaluable organizational resources
that can provide any organization with a competitive advantage.
So, just how important are data and data management to
organizations? From confidential customer information to intellectual
property, to financial transactions to social media posts, organizations
possess massive amounts of data that are critical to their success. Of
course, to benefit from these data, organizations need to manage these
data effectively. This type of management, however, comes at a huge
cost. According to Symantec’s (www.symantec.com) State of
Information Survey, digital information costs organizations worldwide
more than $1 trillion annually. In fact, it makes up roughly half of an
organization’s total value. The survey found that large organizations
spend an average of $40 million annually to maintain and use data,
and small-to-medium-sized businesses spend almost $350,000.
This chapter examines the processes whereby data are transformed
first into information and then into knowledge. Managing data is
critical to all organizations. Few business professionals are
comfortable making or justifying business decisions that are not based
on solid information. This is especially true today, when modern
information systems make access to that information quick and easy.
For example, we have information systems that format data in a way
that managers and analysts can easily understand. Consequently,
these professionals can access these data themselves and then analyze
the data according to their needs. The result is useful information.
Managers can then apply their experience to use this information to
address a business problem, thereby producing knowledge.
Knowledge management, enabled by information technology, captures
and stores knowledge in forms that all organizational employees can
access and apply, thereby creating the flexible, powerful “learning
organization.”
Organizations store data in databases. Recall from Chapter 1 that a
database is a collection of related data files or tables that contain data.
We discuss databases in Section 5.2, focusing on the relational
database model. In Section 5.6, we take a look at the fundamentals of
relational database operations.
Clearly, data and knowledge management are vital to modern
organizations. But, why should you learn about them? The reason is
that you will play an important role in the development of database
applications. The structure and content of your organization’s
database depend on how users (meaning you) define your business
activities. For example, when database developers in the firm’s MIS
group build a database, they use a tool called entity-relationship (ER)
modeling. This tool creates a model of how users view a business
activity. When you understand how to create and interpret an ER
model, then you can evaluate whether the developers have captured
your business activities correctly.
Keep in mind that decisions about data last longer, and have a broader
impact, than decisions about hardware or software. If decisions
concerning hardware are wrong, then the equipment can be replaced
relatively easily. If software decisions turn out to be incorrect, they can
be modified, though not always painlessly or inexpensively. Database
decisions, in contrast, are much harder to undo. Database design
constrains what an organization can do with its data for a long time.
Remember that business users will be stuck with a bad database
design, while the programmers who created the database will quickly
move on to their next projects.
Furthermore, consider that databases typically underlie the enterprise
applications that users access. If there are problems with
organizational databases, then it is unlikely that any applications will
be able to provide the necessary functionality for users. Databases are
difficult to set up properly and to maintain. They are also the
component of an information system that is most likely to receive the
blame when the system performs poorly and the least likely to be
recognized when the system performs well. This is why it is so
important to get database designs right the first time—and you will
play a key role in these designs.
You might also want to create a small, personal database using a
software product such as Microsoft Access. In that case, you will need
to be familiar with at least the basics of the product.
After the data are stored in your organization’s databases, they must
be accessible in a form that helps users to make decisions.
Organizations accomplish this objective by developing data
warehouses. You should become familiar with data warehouses
because they are invaluable decision-making tools. We discuss data
warehouses in Section 5.4.
You will also make extensive use of your organization’s knowledge
base to perform your job. For example, when you are assigned a new
project, you will likely research your firm’s knowledge base to identify
factors that contributed to the success (or failure) of previous, similar
projects. We discuss knowledge management in Section 5.5.
You begin this chapter by examining the multiple challenges involved
in managing data. You then study the database approach that
organizations use to help address these challenges. You turn your
attention to Big Data, which organizations must manage in today’s
business environment. Next, you study data warehouses and data
marts, and you learn how to use them for decision making. You
conclude the chapter by examining knowledge management.

5.1 Managing Data


All IT applications require data. These data should be of high quality,
meaning that they should be accurate, complete, timely, consistent,
accessible, relevant, and concise. Unfortunately, the process of
acquiring, keeping, and managing data is becoming increasingly
difficult.

Author Lecture Videos are available exclusively in


WileyPLUS.
Apply the Concept activities are available in the Appendix
and in WileyPLUS.

The Difficulties of Managing Data


Because data are processed in several stages and often in multiple
locations, they are frequently subject to problems and difficulties.
Managing data in organizations is difficult for many reasons.
First, the amount of data increases exponentially with time. Much
historical data must be kept for a long time, and new data are added
rapidly. For example, to support millions of customers, large retailers
such as Walmart have to manage many petabytes of data. (A petabyte
is approximately 1,000 terabytes, or trillions of bytes; see Technology
Guide 1.)
Data are also scattered throughout organizations, and they are
collected by many individuals using various methods and devices.
These data are frequently stored in numerous servers and locations
and in different computing systems, databases, formats, and human
and computer languages.
Another problem is that data are generated from multiple sources:
internal sources (for example, corporate databases and company
documents); personal sources (for example, personal thoughts,
opinions, and experiences); and external sources (for example,
commercial databases, government reports, and corporate websites).
Data also come from the Web in the form of clickstream data.
Clickstream data is data that visitors and customers produce when
they visit a website and click on hyperlinks (described in Chapter 6).
Clickstream data provides a trail of the users’ activities in the website,
including user behavior and browsing patterns.
Adding to these problems is the fact that new sources of data such as
blogs, podcasts, tweets, Facebook posts, YouTube videos, texts, and
RFID tags and other wireless sensors are constantly being developed,
and the data these technologies generate must be managed. Also, the
data becomes less current over time. For example, customers move to
new addresses or they change their names; companies go out of
business or are bought; new products are developed; employees are
hired or fired; and companies expand into new countries.
Data are also subject to data rot. Data rot refers primarily to problems
with the media on which the data are stored. Over time, temperature,
humidity, and exposure to light can cause physical problems with
storage media and thus make it difficult to access the data. The second
aspect of data rot is that finding the machines needed to access the
data can be difficult. For example, it is almost impossible today to find
8-track players for playing and listening to music. Consequently, a
library of 8-track tapes has become relatively worthless, unless you
have a functioning 8-track player or you convert the tapes to a more
modern medium such as CDs and DVDs.
Data security, quality, and integrity are critical, yet they are easily
jeopardized. Legal requirements relating to data also differ among
countries as well as among industries, and they change frequently.
Another problem arises from the fact that, over time,
organizations have developed information systems for specific
business processes, such as transaction processing, supply chain
management, and customer relationship management. Information
systems that specifically support these processes impose unique
requirements on data; such requirements result in repetition and
conflicts across the organization. For example, the marketing function
might maintain information on customers, sales territories, and
markets. These data might be duplicated within the billing or
customer service functions. This arrangement can produce
inconsistent data within the enterprise. Inconsistent data prevent a
company from developing a unified view of core business information
—data concerning customers, products, finances, and so on—across
the organization and its information systems.
Two other factors complicate data management. First,
federal regulations—for example, the Sarbanes–Oxley Act of 2002—
have made it a top priority for companies to better account for how
they manage information. Sarbanes–Oxley requires that (1) public
companies evaluate and disclose the effectiveness of their internal
financial controls and (2) independent auditors for these companies
agree to this disclosure. The law also holds CEOs and CFOs personally
responsible for such disclosures. If their companies lack satisfactory
data management policies and fraud or a security breach occurs, then
the company officers could be held liable and face prosecution.
Second, companies are drowning in data, much of which are
unstructured. As you have seen, the amount of data is increasing
exponentially. To be profitable, companies must develop a strategy for
managing these data effectively.
An additional problem with data management is Big Data. Big Data
are so important that we devote the entire Section 5.3 to this topic.

Data Governance
To address the numerous problems associated with managing data,
organizations are turning to data governance. Data governance is
an approach to managing information across an entire organization. It
involves a formal set of business processes and policies that are
designed to ensure that data are handled in a certain, well-defined
fashion. That is, the organization follows unambiguous rules for
creating, collecting, handling, and protecting its information. The
objective is to make information available, transparent, and useful for
the people who are authorized to access it, from the moment it enters
an organization until it becomes outdated and is deleted.
One strategy for implementing data governance is master data
management. Master data management is a process that spans all
of an organization’s business processes and applications. It provides
companies with the ability to store, maintain, exchange, and
synchronize a consistent, accurate, and timely “single version of the
truth” for the company’s master data.
Master data are a set of core data, such as customer, product,
employee, vendor, geographic location, and so on, that span the
enterprise’s information systems. It is important to distinguish
between master data and transactional data. Transactional data,
which are generated and captured by operational systems, describe the
business’s activities, or transactions. In contrast, master data are
applied to multiple transactions, and they are used to categorize,
aggregate, and evaluate the transactional data.
Let’s look at an example of a transaction: You (Mary Jones) purchased
one Samsung 42-inch LCD television, part number 1234, from Bill
Roberts at Best Buy, for $2,000, on April 20, 2017. In this example,
the master data are “product sold,” “vendor,” “salesperson,” “store,”
“part number,” “purchase price,” and “date.” When specific values are
applied to the master data, then a transaction is represented.
Therefore, transactional data would be, respectively, “42-inch LCD
television,” “Samsung,” “Best Buy,” “Bill Roberts,” “1234,” “$2,000,”
and “April 20, 2017.”
An example of master data management is Dallas, Texas, which
implemented a plan for digitizing the city’s public and private records,
such as paper documents, images, drawings, and video and audio
content. The master database can be used by any of the 38 government
departments that have appropriate access. The city is also integrating
its financial and billing processes with its customer relationship
management program. (You will learn about customer relationship
management in Chapter 11.)
How will Dallas use this system? Imagine that the city experiences a
water-main break. Before it implemented the system, repair crews had
to search City Hall for records that were filed haphazardly. Once the
workers found the hard-copy blueprints, they took them to the site
and, after they examined them manually, decided on a plan of action.
In contrast, the new system delivers the blueprints wirelessly to the
laptops of crews in the field, who can magnify or highlight areas of
concern to generate a rapid response. This process reduces the time it
takes to respond to an emergency by several hours.
Along with data governance, organizations use the database approach
to efficiently and effectively manage their data. We discuss the
database approach in Section 5.2.

Before you go on …
1. What are some of the difficulties involved in managing data?
2. Define data governance, master data, and transactional
data.
5.2 The Database Approach
From the mid-1950s, when businesses first adopted computer
applications, until the early 1970s, organizations managed their data
in a file management environment. This environment evolved because
organizations typically automated their functions one application at a
time. Therefore, the various automated systems developed
independently from one another, without any overall planning. Each
application required its own data, which were organized in a data file.

Author Lecture Videos are available exclusively in


WileyPLUS.
Apply the Concept activities are available in the Appendix
and in WileyPLUS.

A data file is a collection of logically related records. In a file


management environment, each application has a specific data file
related to it. This file contains all of the data records the application
requires. Over time, organizations developed numerous applications,
each with an associated, application-specific data file.
For example, imagine that most of your information is stored in your
university’s central database. In addition, however, a club to which you
belong maintains its own files, the athletics department has separate
files for student athletes, and your instructors maintain grade data on
their personal computers. It is easy for your name to be misspelled in
one of these databases or files. Similarly, if you move, then your
address might be updated correctly in one database or file but not in
the others.
Using databases eliminates many problems that arose from previous
methods of storing and accessing data, such as file management
systems. Databases are arranged so that one set of software programs
—the database management system—provides all users with access to
all of the data. (You will study database management systems later in
this chapter.) Database systems minimize the following problems:
Data redundancy: The same data are stored in multiple locations.
Data isolation: Applications cannot access data associated with
other applications.
Data inconsistency: Various copies of the data do not agree.
Database systems also maximize the following:
Data security: Because data are “put in one place” in databases,
there is a risk of losing a lot of data at one time. Therefore,
databases must have extremely high security measures in place to
minimize mistakes and deter attacks.
Data integrity: Data meet certain constraints; for example, there
are no alphabetic characters in a Social Security number field.
Data independence: Applications and data are independent of
one another; that is, applications and data are not linked to each
other, so all applications are able to access the same data.
Figure 5.1 illustrates a university database. Note that university
applications from the registrar’s office, the accounting department,
and the athletics department access data through the database
management system.

FIGURE 5.1 Database management system.


A database can contain vast amounts of data. To make these data more
understandable and useful, they are arranged in a hierarchy. We take a
closer look at this hierarchy in the next section.
The Data Hierarchy
Data are organized in a hierarchy that begins with bits and proceeds
all the way to databases (see Figure 5.2). A bit (binary digit)
represents the smallest unit of data a computer can process. The term
binary means that a bit can consist only of a 0 or a 1. A group of eight
bits, called a byte, represents a single character. A byte can be a letter,
a number, or a symbol. A logical grouping of characters into a word, a
small group of words, or an identification number is called a field. For
example, a student’s name in a university’s computer files would
appear in the “name” field, and her or his Social Security number
would appear in the “Social Security number” field. Fields can also
contain data other than text and numbers as well as an image or any
other type of multimedia. Examples are a motor vehicle department’s
licensing database that contains a driver’s photograph and a field that
contains a voice sample to authorize access to a secure facility.

FIGURE 5.2 Hierarchy of data for a computer-based file.


A logical grouping of related fields, such as the student’s name, the
courses taken, the date, and the grade, comprises a record. In the
Apple iTunes Store, a song is a field in a record, with other fields
containing the song’s title, its price, and the album on which it
appears. A logical grouping of related records is called a data file or a
table. For example, a grouping of the records from a particular
course, consisting of course number, professor, and students’ grades,
would constitute a data file for that course. Continuing up the
hierarchy, a logical grouping of related files constitutes a database.
Using the same example, the student course file could be grouped with
files on students’ personal histories and financial backgrounds to
create a student database. In the next section, you will learn about
relational database models.

The Relational Database Model


A database management system (DBMS) is a set of programs
that provide users with tools to create and manage a database.
Managing a database refers to the processes of adding, deleting,
accessing, modifying, and analyzing data that are stored in a database.
An organization can access these data by using query and reporting
tools that are part of the DBMS or by utilizing application programs
specifically written to perform this function. DBMSs also provide the
mechanisms for maintaining the integrity of stored data, managing
security and user access, and recovering information if the system
fails. Because databases and DBMSs are essential to all areas of
business, they must be carefully managed.
There are a number of different database architectures, but we focus
on the relational database model because it is popular and easy to use.
Other database models—for example, the hierarchical and network
models—are the responsibility of the MIS function and are not used by
organizational employees. Popular examples of relational databases
are Microsoft Access and Oracle.
Most business data—especially accounting and financial data—
traditionally were organized into simple tables consisting of columns
and rows. Tables enable people to compare information quickly by row
or column. Users can also retrieve items rather easily by locating the
point of intersection of a particular row and column.
The relational database model is based on the concept of two
dimensional tables. A relational database generally is not one big table
—usually called a flat file—that contains all of the records and
attributes. Such a design would entail far too much data redundancy.
Instead, a relational database is usually designed with a number of
related tables. Each of these tables contains records (listed in rows)
and attributes (listed in columns).
To be valuable, a relational database must be organized so that users
can retrieve, analyze, and understand the data they need. A key to
designing an effective database is the data model. A data model is a
diagram that represents entities in the database and their
relationships. An entity is a person, a place, a thing, or an event—such
as a customer, an employee, or a product—about which an
organization maintains information. Entities can typically be
identified in the user’s work environment. A record generally describes
an entity. An instance of an entity refers to each row in a relational
table, which is a specific, unique representation of the entity. For
example, your university’s student database contains an entity called
“student.” An instance of the student entity would be a particular
student. Thus, you are an instance of the student entity in your
university’s student database.
Each characteristic or quality of a particular entity is called an
attribute. For example, if our entities were a customer, an employee,
and a product, entity attributes would include customer name,
employee number, and product color.
Consider the relational database example about students diagrammed
in Figure 5.3. The table contains data about the entity called
students. As you can see, each row of the table corresponds to a single
student record. (You have your own row in your university’s student
database.) Attributes of the entity are student name, undergraduate
major, grade point average, and graduation date. The rows are the
records on Sally Adams, John Jones, Jane Lee, Kevin Durham, Juan
Rodriguez, Stella Zubnicki, and Ben Jones. Of course, your university
keeps much more data on you than our example shows. In fact, your
university’s student database probably keeps hundreds of attributes on
each student.
FIGURE 5.3 Student database example.
Every record in the database must contain at least one field that
uniquely identifies that record so that it can be retrieved, updated, and
sorted. This identifier field (or attribute) is called the primary key.
For example, a student record in a U.S. university would use a unique
student number as its primary key. (Note: In the past, your Social
Security number served as the primary key for your student record.
However, for security reasons, this practice has been discontinued.) In
Figure 5.3, Sally Adams is uniquely identified by her student ID of 111-
12-4321.
In some cases, locating a particular record requires the use of
secondary keys. A secondary key is another field that has some
identifying information but typically does not identify the record with
complete accuracy. For example, the student’s major might be a
secondary key if a user wanted to identify all of the students majoring
in a particular field of study. It should not be the primary key,
however, because many students can have the same major. Therefore,
it cannot uniquely identify an individual student.
A foreign key is a field (or group of fields) in one table that uniquely
identifies a row of another table. A foreign key is used to establish and
enforce a link between two tables. We discuss foreign keys in Section
5.6.
Organizations implement databases to efficiently and effectively
manage their data. There are a variety of operations that can be
performed on databases. We look at three of these operations in detail
in Section 5.6: query languages, normalization, and joins.
As we noted earlier in this chapter, organizations must manage huge
quantities of data. Such data consist of structured and unstructured
data and are called Big Data (discussed in Section 5.3). Structured
data is highly organized in fixed fields in a data repository such as a
relational database. Structured data must be defined in terms of field
name and type (e.g., alphanumeric, numeric, and currency).
Unstructured data is data that does not reside in a traditional
relational database. Examples of unstructured data are e-mail
messages, word processing documents, videos, images, audio files,
PowerPoint presentations, Facebook posts, tweets, snaps, ratings and
recommendations, and Web pages. Industry analysts estimate that 80
to 90 percent of the data in an organization is unstructured. To
manage Big Data, many organizations are using special types of
databases, which we also discuss in Section 5.3.
Because databases typically process data in real time (or near real
time), it is not practical to allow users access to the databases. After
all, the data will change while the user is looking at them! As a result,
data warehouses have been developed to allow users to access data for
decision making. You will learn about data warehouses in Section 5.4.

Before you go on …
1. What is a data model?
2. What is a primary key? A secondary key?
3. What is an entity? An attribute? An instance?
4. What are the advantages and disadvantages of relational
databases?

5.3 Big Data


We are accumulating data and information at an increasingly rapid
pace from many diverse sources. In fact, organizations are capturing
data about almost all events, including events that, in the past, firms
never used to think of as data at all—for example, a person’s location,
the vibrations and temperature of an engine, and the stress at
numerous points on a bridge—and then analyzing those data.

Author Lecture Videos are available exclusively in


WileyPLUS.
Apply the Concept activities are available in the Appendix
and in WileyPLUS.

Organizations and individuals must process a vast amount of data that


continues to increase dramatically. According to IDC (a technology
research firm; www.idc.com), the world generates over one
zettabyte (1021 bytes) of data each year. Furthermore, the amount of
data produced worldwide is increasing by 50 percent each year.
As recently as the year 2000, only 25 percent of the stored information
in the world was digital. The other 75 percent was analog; that is, it
was stored on paper, film, vinyl records, and the like. By 2019, the
amount of stored information in the world was more than 98 percent
digital and less than 2 percent nondigital.
As we discussed at the beginning of this chapter, we refer to the
superabundance of data available today as Big Data. Big Data is a
collection of data that is so large and complex that it is difficult to
manage using traditional database management systems. (We
capitalize Big Data to distinguish the term from large amounts of
traditional data.)
Essentially, Big Data is about predictions (see Predictive Analytics in
Chapter 12). Predictions do not come from “teaching” computers to
“think” like humans. Instead, predictions come from applying
mathematics to huge quantities of data to infer probabilities. Consider
the following examples:
The likelihood that an e-mail message is spam
The likelihood that the typed letters teh are supposed to be the
The likelihood that the direction and speed of a person jaywalking
indicate that he will make it across the street in time to avoid
getting hit by a vehicle, meaning that a self-driving car need slow
down only slightly
Big Data systems perform well because they contain huge amounts of
data on which to base their predictions. Moreover, these systems are
configured to improve themselves over time by searching for the most
valuable signals and patterns as more data are input.

Defining Big Data


It is difficult to define Big Data. Here, we present two descriptions of
the phenomenon. First, the technology research firm Gartner
(www.gartner.com) defines Big Data as diverse, high-volume, high-
velocity information assets that require new forms of processing to
enhance decision making, lead to insights, and optimize business
processes. Second, The Big Data Institute (TBDI;
www.thebigdatainstitute.wordpress.com/) defines Big Data as
vast datasets that do the following:
Exhibit variety
Include structured, unstructured, and semistructured data
Are generated at high velocity with an uncertain pattern
Do not fit neatly into traditional, structured, relational databases
Can be captured, processed, transformed, and analyzed in a
reasonable amount of time only by sophisticated information
systems
Big Data generally consists of the following:
Traditional enterprise data—examples are customer information
from customer relationship management systems, transactional
enterprise resource planning data, Web store transactions,
operations data, and general ledger data.
Machine-generated/sensor data—examples are smart meters;
manufacturing sensors; sensors integrated into smartphones,
automobiles, airplane engines, and industrial machines;
equipment logs; and trading systems data.
Social data—examples are customer feedback comments;
microblogging sites such as Twitter; and social media sites such as
Facebook, YouTube, and LinkedIn.
Images captured by billions of devices located throughout the
world, from digital cameras and camera phones to medical
scanners and security cameras.
Let’s take a look at a few specific examples of Big Data:
Facebook’s 2.4 billion users upload more than 350 million new
photos every day. They also click a “Like” button or leave a
comment more than 5 billion times every day. Facebook’s data
warehouse stores more than 300 petabytes of data and Facebook
receives 600 terabytes of incoming data per day.
The 1.3 billion users of Google’s YouTube service upload more
than 300 hours of video per minute. Google itself processes on
average more than 63,000 search queries per second.
In July 2019, industry analysts estimated that Twitter users sent
some 550 million tweets per day.
The Met Office is the United Kingdom’s national weather and
climate service. The Office provides tailored weather and
environmental forecasts and briefings to the government and the
public. The Office also informs various organizations, such as
advising wind farms where to place their turbines and telling
airports exactly how much deicer to spray on a plane so that none
is wasted.
Characteristics of Big Data
Big Data has three distinct characteristics: volume, velocity, and
variety. These characteristics distinguish Big Data from traditional
data.
1. Volume: We have noted the huge volume of Big Data. Consider
machine-generated data, which are generated in much larger
quantities than nontraditional data. For example, sensors in a
single jet engine can generate 10 terabytes of data in 30 minutes.
(See our discussion of the Internet of Things in Chapter 8.) With
more than 25,000 airline flights per day, the daily volume of data
from just this single source is incredible. Smart electrical meters,
sensors in heavy industrial equipment, and telemetry from
automobiles compound the volume problem.
2. Velocity: The rate at which data flow into an organization is
rapidly increasing. Velocity is critical because it increases the
speed of the feedback loop between a company, its customers, its
suppliers, and its business partners. For example, the Internet
and mobile technology enable online retailers to compile histories
not only on final sales but also on their customers’ every click and
interaction. Companies that can quickly use that information—for
example, by recommending additional purchases—gain a
competitive advantage.
3. Variety: Traditional data formats tend to be structured and
relatively well described, and they change slowly. Traditional data
include financial market data, point-of-sale transactions, and
much more. In contrast, Big Data formats change rapidly. They
include satellite imagery, broadcast audio streams, digital music
files, Web page content, scans of government documents, and
comments posted on social networks.
Irrespective of their source, structure, format, and frequency, Big Data
are valuable. If certain types of data appear to have no value today, it is
because we have not yet been able to analyze them effectively. For
example, several years ago when Google began harnessing satellite
imagery, capturing street views, and then sharing this geographical
data for free, few people understood its value. Today, we recognize
that such data is incredibly valuable because analyses of Big Data yield
deep insights. We discuss analytics in detail in Chapter 12. IT’s About
Business 5.1 provides an example of Big Data with a discussion of data
from connected cars.

IT’s About Business 5.1


Data from Connected Vehicles Is Valuable

A connected vehicle is equipped with Internet access and usually


with a wireless local area network (LAN). These features enable the
vehicle to share Internet access with other devices inside as well as
outside the vehicle.
Connected vehicle technology is becoming standard in many new
cars. Semiautonomous driving is a reality, with advanced driver-
assistance systems enhancing safety. Also, fully autonomous
vehicles, which can operate without any human intervention, are
already in service on roadways.
Research firm Gartner (www.gartner.com) forecasts that there
will be 250 million connected cars on global roads by 2020. The
combination of new car features as well as technology added to
existing cars could increase that number to 2 billion by 2025. IHS
Automotive (www.ihs.com) estimates that the average connected
car will produce up to 30 terabytes of data each day.
Consider that when you start a late-model car today, it updates
more than 100,000 data variables, from the pressure in your tires
to pollution levels in your engine’s exhaust. The onboard
navigation system tracks every mile you drive and remembers your
favorite route to work. Your car can help you avoid traffic jams and
find a place to park. Some of today’s cars even collect data on the
weight of each occupant in the car.
There are many opportunities to use real-time, streaming data
from connected cars. McKinsey (www.mckinsey.com) estimates
that in-car data services could generate more than $1 trillion of
annual revenue by 2030. Let’s look at five areas where connected
car data will be particularly valuable.

The driving experience.


Programmable seats, preset radio stations, driver information
centers, heads-up displays, adaptive cruise control, lane departure
notification, automatic parallel parking, and collision avoidance
are all features in many of today’s newer cars. But, what if the car
learned the driver’s preferences? For example, a driver might want
to listen to the news in the morning on the way to work to be
informed and then to classical music on the way home to relax.
Perhaps on most Fridays, the driver takes a detour from her usual
route home to meet a friend at a coffee shop. So now, when she
gets behind the wheel to go to work on Friday, the radio is set to
public radio. After work, the car’s radio is set to classical music.
Along the way, the navigation system warns her of an accident
along her normal route, and it recommends an alternate route to
get to the coffee shop.

Driver well-being.
Connected cars have functions that involve the driver’s ability and
fitness to drive. These apps include fatigue detection, automatic
detection of inebriated drivers, and the ability of the car to
summon medical assistance if the driver is unable to do so.
Another interesting app involves the use of in-car cameras to
detect distracted driving, such as texting while driving.

Location-based services.
Data from each car can be used to provide location-based offers.
For example, the driver above could receive an offer for a discount
on her favorite cup of coffee as she arrived at the coffee shop.

Quality and reliability.


Today, social media has shifted power to the consumer, and they
have highlighted the importance of vehicle quality and reliability.
These two issues are key elements in building a strong automotive
brand reputation and customer loyalty. However, no matter how
excellent the manufacturing process, automobiles are complex
machines, and issues often appear in the field. Automobile
manufacturers use data from connected cars to find and resolve
these issues quickly. In this way, these companies will reduce
warranty costs and enhance their brand and customer loyalty.
Using data from sensors, the cars themselves can predict
needed service and maintenance and notify the driver. The sensor
data can also enable dealerships to perform remote diagnostics to
assess when a service or repair is needed and to identify the needed
parts. When the customer arrives for an appointment, the
technician with the appropriate skills is available with the correct
parts to perform the service and to minimize any customer
inconvenience. This process will also optimize inventory levels in
dealerships and minimize inventory-carrying costs.
This sensor data provides valuable insights regarding the
performance and health of the vehicle: for example, how, when,
and where the vehicle is driven; the driver’s driving style and
preferences; and many other variables. Analyzing these data can
help to provide a better, safer driving experience while enhancing
vehicle quality and reliability.

Infotainment (Information + Entertainment).


The average American spends, on average, 8 percent of each day
(about two hours) in his or her car. Infotainment has existed in
cars since the 1930s. Consider the following audio features in cars:
Radios began to appear in cars in the 1930s.
FM radio appeared in the 1950s.
Eight-track tape players appeared in the mid-1960s.
The cassette deck debuted in the 1970s.
The first factory CD player appeared in 1985.
Satellite radio appeared in the 1990s.
Today, drivers can connect almost any digital device that plays
audio and video to their car’s infotainment system. For example,
how much do parents appreciate built-in movie players for their
children in the backseat? Mobile Wi-Fi connectivity has brought
streaming services for news and entertainment into the car as well.
There are many ways to capitalize on the integration of
infotainment data and vehicle data. For example, companies use
data on location, length of average trip, miles driven per week,
number of passengers, date, and/or day of week to provide various
offerings.
Recommended content: Short-form content may be suggested
when you are taking the children to baseball practice and long-
form content suggested when you drive to the beach. Short-
form content is short in length, such as video clips, listicles
(short articles composed entirely of a list of items), and blog
posts fewer than 1,000 words. This type of content can appeal
to users’ limited attention spans. Long-form content consists
of longer articles (typically between 1,200 and 20,000 words)
that contain greater amounts of content than short-form
content. These articles often take the form of creative
nonfiction or narrative journalism.
Mobile payments: Pay-per-view movies and sporting events
(for backseat passengers).
Intelligent messaging: You can access breaking news, traffic,
and weather reports based on actual travel patterns.
Live content transfer: When you have arrived at your
destination, you can continue watching on your tablet,
smartphone, or television.
Many companies are competing in the connected services
marketplace. For example, Weve (www.weve.com/about) is a
leading provider of mobile marketing and commerce in the United
Kingdom. The firm combines and analyzes real-time data streams
from 17 million mobile users for intelligent messaging, targeted
marketing offers, and mobile payments. These data streams
include location data, purchase history, daily routines, and social
data that users choose to make public.
All of the automakers are worried about Google and Apple. These
two technology companies are developing cars and self-driving
technology, and both are leaders in connected services. For
example, Google’s Waze is a leading in-car app. Waze is an
advertising-supported and crowdsourced program that offers
congestion-avoiding directions. Waze examines personal data in
smartphones with users’ permission and sends pop-up ads to the
screens of its 50 million global users.
Ford (www.ford.com), BMW (www.bmw.com), General
Motors (GM; www.gm.com), and other automakers have
deployed systems that can host, and limit, in-car apps produced by
competitors. These systems allow drivers to plug in Apple’s and
Google’s competing vehicle screen operating systems, CarPlay and
Android Auto, respectively, without giving the two technology
companies access to drivers’ personal information or vehicle
diagnostics. For example, Ford’s offering in this area, which Toyota
also uses, is called AppLink. The app enables the car to access 90
phone apps without using CarPlay or Android Auto as an
intermediary.
Other companies are making efforts to develop in-car services. As
one example, in 2016, BMW, Daimler, and Volkswagen teamed up
to purchase Nokia’s digital mapping business, called Here
(www.here.com/en), for $3.1 billion. This acquisition gave these
companies a platform for location-based services and, eventually,
for mapping capabilities for self-driving cars. GM’s OnStar
(www.onstar.com) mobile information subscription service
offers dashboard-delivered coupons for Exxon and Mobil gas
stations as well as the ability to book hotel rooms. Mercedes-Benz’s
concierge service Mbrace
(www.mbusa.com/mercedes/mbrace) can route a driver
around traffic or bad weather. Both OnStar and Mbrace cost about
$20 per month. Alibaba Group (www.alibaba.com), whose
YunOS (operating system) connects phones, tablets, and
smartwatches, is also working on deals with Chinese automakers to
operate with vehicles.
Significantly, interest in car data is not limited to technology and
automotive companies. The insurance industry also values access
to data about driving habits. The insurance companies could then
use these data to charge higher rates for drivers who exceed speed
limits and lower rates for those who drive safely. Along these lines,
GM and Ford offer drivers an app that calculates a driver score that
could reduce their insurance rates if they exhibit safe habits.
Finally, IBM (www.ibm.com) is planning to become a single
point of contact for all parts of the automotive industry. The
company has signed a contract with BMW that will connect BMW’s
CarData platform to IBM’s Bluemix platform-as-a-service cloud
computing application (see Technology Guide 3). The idea is that
IBM will host and analyze data from connected cars and then send
the data to third parties—with drivers’ consent—when required.
Early applications can link your car to your local insurance agent,
dealership, and automotive repair shop.
Sources: Compiled from A. Ross, “The Connected Car ‘Data Explosion’:
The Challenges and Opportunities,” Information Age, July 10, 2018; “The
Connected Car, Big Data, and the Automotive Industry’s Future,” Datameer
Blog, February 26, 2018; M. Spillar, “How Big Data Is Paving the Way for
the Connected Car,” Hortonworks Blog, January 23, 2018; L. Stolle, “Is
Data from Connected Vehicles Valuable and, if So, to Whom?” SAP Blogs,
June 21, 2017; C. Hall, “BMW’s Connected-Car Data Platform to Run in
IBM’s Cloud,” Data Center Knowledge, June 16, 2017; D. Cooper, “IBM
Will Put Connected Car Data to Better Use,” Engadget, June 14, 2017; N.
Ismail, “The Present and Future of Connected Car Data,” Information Age,
May 17, 2017; R. Ferris, “An ‘Ocean of Auto Big Data’ Is Coming, Says
Barclays,” CNBC, April 26, 2017; D. Newcomb, “Connected Car Data Is the
New Oil,” Entrepreneur, April 17, 2017; M. McFarland, “Your Car’s Data
May Soon Be More Valuable than the Car Itself,” CNN, February 7, 2017; L.
Slowey, “Big Data and the Challenges in the Car Industry,” IBM Internet of
Things Blog, January 12, 2017; P. Nelson, “Just One Autonomous Car Will
Use 4,000 GB of Data/Day,” Network World, December 7, 2016; S. Tiao,
“The Connected Car, Big Data, and the Automotive Industry’s Future,”
Datameer, October 27, 2016; M. DeBord, “Big Data in Cars Could Be a
$750 Billion Business by 2030,” Business Insider, October 3, 2016; D.
Welch, “The Battle for Smart Car Data,” Bloomberg BusinessWeek, July 18-
24, 2016; D. Booth, “It’s Time for Automatic Alcohol Sensors in Every Car,”
Driving, May 6, 2016; “The Connected Vehicle: Big Data, Big
Opportunities,” SAS White Paper, 2016; G. Krueger, “Connected Vehicle
Data: The Cost Challenge,” Federal Department of Transportation, 2015;
and www.ihs.com, www.bmw.com, accessed June 27, 2019.

Questions
1. Describe several other uses (other than the ones discussed in
this case) for data from connected cars.
2. Would data from connected cars be considered Big Data? Why
or why not? Support your answer.

Issues with Big Data


Despite its extreme value, Big Data does have issues. In this section,
we take a look at data integrity, data quality, and the nuances of
analysis that are worth noting.

Big Data Can Come from Untrusted Sources.


As we discussed earlier, one of the characteristics of Big Data is
variety, meaning that Big Data can come from numerous, widely
varied sources. These sources may be internal or external to an
organization. For example, a company might want to integrate data
from unstructured sources such as e-mails, call center notes, and
social media posts with structured data about its customers from its
data warehouse. The question is: How trustworthy are those external
sources of data? For example, how trustworthy is a tweet? The data
may come from an unverified source. Furthermore, the data itself,
reported by the source, may be false or misleading.

Big Data Is Dirty.


Dirty data refers to inaccurate, incomplete, incorrect, duplicate, or
erroneous data. Examples of such problems are misspelling of words
and duplicate data such as retweets or company press releases that
appear multiple times in social media.
Suppose a company is interested in performing a competitive analysis
using social media data. The company wants to see how often a
competitor’s product appears in social media outlets as well as the
sentiments associated with those posts. The company notices that the
number of positive posts about the competitor is twice as great as the
number of positive posts about itself. This finding could simply be a
case of the competitor pushing out its press releases to multiple
sources—in essence, blowing its own horn. Alternatively, the
competitor could be getting many people to retweet an announcement.

Big Data Changes, Especially in Data Streams.


Organizations must be aware that data quality in an analysis can
change, or the actual data can change, because the conditions under
which the data are captured can change. For example, imagine a utility
company that analyzes weather data and smart-meter data to predict
customer power usage. What happens when the utility analyzes these
data in real time and it discovers that data are missing from some of
its smart meters?

Managing Big Data


Big Data makes it possible to do many things that were previously
much more difficult: for example, to spot business trends more rapidly
and accurately, to prevent disease, and to track crime. When Big Data
is properly analyzed, it can reveal valuable patterns and information
that were previously hidden because of the amount of work required to
discover them. Leading corporations, such as Walmart and Google,
have been able to process Big Data for years, but only at great expense.
Today’s hardware, cloud computing (see Technology Guide 3), and
open-source software make processing Big Data affordable for most
organizations.
For many organizations, the first step toward managing data was to
integrate information silos into a database environment and then to
develop data warehouses for decision making. (An information silo is
an information system that does not communicate with other, related
information systems in an organization.) After they completed the first
step, many organizations turned their attention to the business of
information management—making sense of their rapidly expanding
data. In recent years, Oracle, IBM, Microsoft, and SAP have spent
billions of dollars purchasing software firms that specialize in data
management and business analytics. (You will learn about business
analytics in Chapter 12.)
In addition to using existing data management systems, today many
organizations also employ NoSQL databases to process Big Data.
Think of them as “not only SQL” (structured query language)
databases. (We discuss SQL in Section 5.6.)
As you have seen in this chapter, traditional relational databases such
as Oracle and MySQL store data in tables organized into rows and
columns. Recall that each row is associated with a unique record, and
each column is associated with a field that defines an attribute of that
account.
In contrast, NoSQL databases can manipulate structured and
unstructured data as well as inconsistent or missing data. For this
reason, NoSQL databases are particularly useful when working with
Big Data. Many products use NoSQL databases, including Cassandra
(www.cassandra.apache.org), CouchDB
(www.couchdb.apache.org), and MongoDB
(www.mongodb.org). Below we consider three examples of NoSQL
databases in action: Temetra, Hadoop, and MapReduce.
Temetra, founded in 2002, stores and manages data from 12.5 million
utility meters across the United Kingdom. The unpredictable nature of
the type and volume of the data the firm receives convinced them to
deploy the Riak NoSQL database from Basho. Temetra collects data
from sources ranging from people keying in meter readings to large-
scale industry users who send data automatically every 15 minutes.
Temetra initially supported its customers with an SQL database.
Eventually, however, it became unable to effectively manage that
volume of data. The company deployed the NoSQL database to handle
the volume of data and to acquire the capability to add new data types.
In essence, NoSQL enabled Temetra to store data without a rigid
format.
Hadoop (www.hadoop.apache.org) is not a type of database.
Rather, it is a collection of programs that allow people to store,
retrieve, and analyze very large datasets using massively parallel
processing. Massively parallel processing is the coordinated
processing of an application by multiple processors that work on
different parts of the application, with each processor utilizing its own
operating system and memory. As such, Hadoop enables users to
access NoSQL databases, which can be spread across thousands of
servers, without a reduction in performance. For example, a large
database application that could take 20 hours of processing time on a
centralized relational database system might take only a few minutes
when using Hadoop’s parallel processing.
MapReduce refers to the software procedure of dividing an analysis
into pieces that can be distributed across different servers in multiple
locations. MapReduce first distributes the analysis (map) and then
collects and integrates the results back into a single report (reduce).
Google has been developing a new type of database, called Cloud
Spanner, for some time. In February 2017, the company released
Cloud Spanner to the public as a service. IT’s About Business 5.2
addresses this database.

IT’s About Business 5.2


Cloud Spanner, Google’s Global Database

Systems that encompass hundreds of thousands of computers and


multiple data centers must precisely synchronize time around the
world. These systems involve communication among computers in
many locations, and time varies from computer to computer
because precise time is difficult to keep.
Services such as the Network Time Protocol aimed to provide
computers with a common time reference point. However, this
protocol worked only so well, primarily because computer network
transmission speeds are relatively slow compared to the processing
speeds of the computers themselves. In fact, before Google
developed Spanner, companies found it extremely difficult, if not
impossible, to keep databases consistent without constant, intense
communication. Conducting this type of communication around
the world took too much time.
For Google, the problem of time was critical. The firm’s databases
operate in data centers that are dispersed throughout the world.
Therefore, Google could not ensure that transactions in one part of
the world matched transactions in another part. That is, Google
could not obtain a truly global picture of its operations. The firm
could not seamlessly replicate data across regions or quickly
retrieve replicated data when they were needed. Google’s engineers
had to find a way to produce reliable time across the world.
To resolve this problem, Google developed the Cloud Spanner,
which is the firm’s globally distributed NewSQL database. Spanner
stores data across millions of computers located in data centers on
multiple continents. Even though Spanner stretches around the
world, the database functions as if it is in one place.
NewSQL is a class of modern relational database management
system that attempts to provide the same scalable performance of
NoSQL systems for online transaction processing (OLTP) functions
while still maintaining the ACID guarantees of a traditional
database system. The ACID guarantees include the following:
Atomicity: Each transaction is all or nothing. If one part of the
transaction fails, then the entire transaction fails, and the
database is left unchanged.
Consistency: Any transaction will bring the database from one
valid state to another.
Isolation: The concurrent execution of transactions results in a
system state that would be obtained if transactions were
executed sequentially (one after another).
Durability: Once a transaction has been committed, it will
remain so, even in the event of power loss, system crashes, or
errors.
Cloud Spanner offers companies the best of both traditional
relational databases and NoSQL databases—that is, transactional
consistency with easy scalability. However, Spanner is not a simple
scale-up relational database service: Cloud SQL is that Google
product. Spanner is not a data warehouse: BigQuery is that Google
product. Finally, Spanner is not a NoSQL database: BigTable is
that Google product.
Google can change company data in one part of Spanner—for
example, running an ad or debiting an advertiser’s account—
without contradicting changes made on the other side of the
planet. In addition, Spanner can readily and reliably replicate data
across multiple data centers in multiple parts of the world, and it
can seamlessly retrieve these copies if any individual data center
goes down. In essence, Spanner provides Google with consistency
across continents.
So, how did Google solve the time problem by creating Spanner?
Google engineers equipped Google’s data centers with global
positioning system (GPS; see Chapter 8) receivers and atomic
clocks. The GPS receivers obtain the time from various satellites
orbiting the earth, while the atomic clocks keep their own, highly
accurate, time. The GPS devices and atomic clocks send their time
readings to master servers in each data center. These servers trade
readings in an effort to settle on a common time. A margin of error
still exists. However, because there are so many time readings, the
master servers become a far more reliable timekeeping service.
Google calls this timekeeping technology TrueTime.
With help from TrueTime, Spanner provides Google with a
competitive advantage in its diverse markets. TrueTime is an
underlying technology for AdWords and Gmail as well as more
than 2,000 other Google services, including Google Photos and the
Google Play store. Google can now manage online transactions at
an unprecedented scale. Furthermore, thanks to Spanner’s extreme
form of data replication, Google now keeps its services operational
with unprecedented consistency.
Google hopes to convince customers that Spanner provides an
easier way of running a global business and of replicating their
data across multiple regions, thereby guarding against outages.
The problem is that few businesses are truly global on the same
scale as Google. However, Google is betting that Spanner will give
customers the freedom to expand as time goes on. One such
customer is JDA (www.jda.com), a company that helps
businesses oversee their supply chains. JDA is testing Spanner
because the volume and velocity of the firm’s data are increasing
exponentially.
Spanner could also be useful in the financial markets by enabling
large-scale banks to more efficiently track and synchronize trades
occurring around the world. Traditionally, many banks have been
cautious about managing trades in the cloud for security and
privacy reasons. However, some banks are now investigating
Spanner.
Google is offering Spanner technology to all customers as a cloud
computing service. Google believes that Spanner can create a
competitive advantage in its battle with Microsoft and Amazon for
supremacy in the cloud computing marketplace.
Sources: Compiled from A. Penkava, “Google Cloud Spanner: The Good,
the Bad, and the Ugly,” Lightspeed, March 21, 2018; A. Cobley, “Get Tooled
up before Grappling with Google’s Spanner Database,” The Register, March
12, 2018; T. Baer, “Making the Jump to Google Cloud Spanner,” ZDNet,
February 22, 2018; A. Brust, “Google’s Cloud Spanner: How Does It Stack
Up?” ZDNet, July 7, 2017; F. Lardinois, “Google’s Globally Distributed
Cloud Spanner Database Service Is Now Generally Available,” TechCrunch,
May 16, 2017; S. Yegulalp, “Google’s Cloud Spanner Melds Transactional
Consistency, NoSQL Scale,” InfoWorld, May 4, 2017; A. Cobley, “Google
Spanner in the NewSQL Works?” The Register, March 21, 2017; D. Borsos,
“Google Cloud Spanner: Our First Impressions,” OpenCredo, March 7,
2017; F. Lardinois, “Google Launches Cloud Spanner, Its New Globally
Distributed Relational Database Service,” TechCrunch, February 14, 2017;
C. Metz, “Spanner, the Google Database that Mastered Time, Is Now Open
to Everyone,” Wired, February 14, 2017; B. Darrow, “Google Spanner
Database Surfaces at Last,” Fortune, February 14, 2017; C. Metz, “Google
Unites Worldwide Data Centers with GPS and Atomic Clocks,” Wired UK,
September 20, 2012; and www.google.com, accessed June 27, 2019.

Questions
1. Discuss the advantages of Google Cloud Spanner to
organizations.
2. If your company does not operate globally, would Google
Cloud Spanner still be valuable? Why or why not? Support
your answer.

Putting Big Data to Use


Modern organizations must manage Big Data and gain value from it.
They can employ several strategies to achieve this objective.

Making Big Data Available.


Making Big Data available for relevant stakeholders can help
organizations gain value. For example, consider open data in the
public sector. Open data are accessible public data that individuals and
organizations can use to create new businesses and solve complex
problems. In particular, government agencies gather vast amounts of
data, some of which are Big Data. Making that data available can
provide economic benefits. In fact, an Open Data 500 study at the
GovLab at New York University discovered 500 examples of U.S.-
based companies whose business models depend on analyzing open
government data.

Enabling Organizations to Conduct Experiments.


Big Data allows organizations to improve performance by conducting
controlled experiments. For example, Amazon (and many other
companies such as Google and LinkedIn) constantly experiments by
offering slightly different looks on its website. These experiments are
called A/B experiments, because each experiment has only two
possible outcomes. Here is an example of an A/B experiment at
Etsy.com, an online marketplace for vintage and handmade
products.
When Etsy analysts noticed that one of its Web pages attracted
customer attention but failed to maintain it, they looked more closely
at the page and discovered that it had few “calls to action.” (A call to
action is an item, such as a button, on a Web page that enables a
customer to do something.) On this particular Etsy page, customers
could leave, buy, search, or click on two additional product images.
The analysts decided to show more product images on the page.
Consequently, one group of visitors to the page saw a strip across the
top of the page that displayed additional product images. Another
group saw only the two original product images. On the page with
additional images, customers viewed more products and, significantly,
bought more products. The results of this experiment revealed
valuable information to Etsy.

Microsegmentation of Customers.
Segmentation of a company’s customers means dividing them into
groups that share one or more characteristics. Microsegmentation
simply means dividing customers up into very small groups, or even
down to an individual customer.
For example, Paytronix Systems (www.paytronix.com)
provides loyalty and rewards program software for thousands of
different restaurants. Paytronix gathers restaurant guest data from a
variety of sources beyond loyalty and gift programs, including social
media. Paytronix then analyzes this Big Data to help its restaurant
clients microsegment their guests. Restaurant managers are now able
to more precisely customize their loyalty and gift programs. Since they
have taken these steps, they have noted improved profitability and
customer satisfaction in their restaurants.

Creating New Business Models.


Companies are able to use Big Data to create new business models. For
example, a commercial transportation company operated a substantial
fleet of large, long-haul trucks. The company recently placed sensors
on all of its trucks. These sensors wirelessly communicated sizeable
amounts of information to the company, a process called telematics.
The sensors collected data on vehicle usage—including acceleration,
braking, cornering, and so on—in addition to driver performance and
vehicle maintenance.
By analyzing this Big Data, the company was able to improve the
condition of its trucks through near-real-time analysis that proactively
suggested preventive maintenance. The company was also able to
improve the driving skills of its operators by analyzing their driving
styles.
The transportation company then made its Big Data available to its
insurance carrier. Using this data, the insurance carrier was able to
perform a more precise risk analysis of driver behavior and the
condition of the trucks. The carrier then offered the transportation
company a new pricing model that lowered its premiums by 10 percent
due to safety improvements enabled by analysis of the Big Data.

Organizations Can Analyze More Data.


In some cases, organizations can even process all of the data relating
to a particular phenomenon, so they do not have to rely as much on
sampling. Random sampling works well, but it is not as effective as
analyzing an entire dataset. Random sampling also has some basic
weaknesses. To begin with, its accuracy depends on ensuring
randomness when collecting the sample data. However, achieving
such randomness is problematic. Systematic biases in the process of
data collection can cause the results to be highly inaccurate. For
example, consider political polling using landline phones. This sample
tends to exclude people who use only cell phones. This bias can
seriously skew the results because cell phone users are typically
younger and more liberal than people who rely primarily on landline
phones.

Big Data Used in the Functional Areas of the


Organization
In this section, we provide examples of how Big Data is valuable to
various functional areas in the firm.

Human Resources.
Employee benefits, particularly healthcare, represent a major business
expense. Consequently, some companies have turned to Big Data to
better manage these benefits. Caesars Entertainment
(www.caesars.com), for example, analyzes health-insurance claim
data for its 65,000 employees and their covered family members.
Managers can track thousands of variables that indicate how
employees use medical services, such as the number of their
emergency room visits and whether employees choose a generic or
brand name drug.
Consider the following scenario: Data revealed that too many
employees with medical emergencies were being treated at hospital
emergency rooms rather than at less expensive urgent-care facilities.
The company launched a campaign to remind employees of the high
cost of emergency room visits, and they provided a list of alternative
facilities. Subsequently, 10,000 emergencies shifted to less expensive
alternatives, for a total savings of $4.5 million.
Big Data is also having an impact on hiring. An example is Catalyte
(www.catalyte.io), a technology outsourcing company that hires
teams for programming jobs. Traditional recruiting is typically too
slow, and hiring managers often subjectively choose candidates who
are not the best fit for the job. Catalyte addresses this problem by
requiring candidates to fill out an online assessment. It then uses the
assessment to collect thousands of data points about each candidate.
In fact, the company collects more data based on how candidates
answer than on what they answer.
For example, the assessment might give a problem requiring calculus
to an applicant who is not expected to know the subject. How the
candidate responds—laboring over an answer, answering quickly, and
then returning later, or skipping the problem entirely—provides
insight into how that candidate might deal with challenges that he or
she will encounter on the job. That is, someone who labors over a
difficult question might be effective in an assignment that requires a
methodical approach to problem solving, whereas an applicant who
takes a more aggressive approach might perform better in a different
job setting.
The benefit of this Big Data approach is that it recognizes that people
bring different skills to the table and there is no one-size-fits-all
person for any job. Analyzing millions of data points can reveal which
attributes candidates bring to specific situations.
As one measure of success, employee turnover at Catalyte averages
about 15 percent per year, compared with more than 30 percent for its
U.S. competitors and more than 20 percent for similar companies
overseas.

Product Development.
Big Data can help capture customer preferences and put that
information to work in designing new products. For example, Ford
Motor Company (www.ford.com) was considering a “three blink”
turn indicator that had been available on its European cars for years.
Unlike the turn signals on its U.S. vehicles, this indicator flashes three
times at the driver’s touch and then automatically shuts off.
Ford decided that conducting a full-scale market research test on this
blinker would be too costly and time consuming. Instead, it examined
auto-enthusiast websites and owner forums to discover what drivers
were saying about turn indicators. Using text-mining algorithms,
researchers culled more than 10,000 mentions and then summarized
the most relevant comments.
The results? Ford introduced the three-blink indicator on the Ford
Fiesta it released in 2010, and by 2013 it was available on most Ford
products. Although some Ford owners complained online that they
have had trouble getting used to the new turn indicator, many others
defended it. Ford managers note that the use of text-mining
algorithms was critical in this effort because they provided the
company with a complete picture that would not have been available
using traditional market research.

Operations.
For years, companies have been using information technology to make
their operations more efficient. Consider United Parcel Service (UPS).
The company has long relied on data to improve its operations.
Specifically, it uses sensors in its delivery vehicles that can, among
other things, capture each truck’s speed and location, the number of
times it is placed in reverse, and whether the driver’s seat belt is
buckled. This data is uploaded at the end of each day to a UPS data
center, where it is analyzed overnight. By combining GPS information
and data from sensors installed on more than 46,000 vehicles, UPS
reduced fuel consumption by 8.4 million gallons, and it cut 85 million
miles off its routes.

Marketing.
Marketing managers have long used data to better understand their
customers and to target their marketing efforts more directly. Today,
Big Data enables marketers to craft much more personalized
messages.
The United Kingdom’s InterContinental Hotels Group (IHG;
www.ihg.com) has gathered details about the members of its
Priority Club rewards program, such as income levels and whether
members prefer family-style or business-traveler accommodations.
The company then consolidated all this information with information
obtained from social media into a single data warehouse. Using its
data warehouse and analytics software, the hotelier launched a new
marketing campaign. Where previous marketing campaigns
generated, on average, between 7 and 15 customized marketing
messages, the new campaign generated more than 1,500. IHG rolled
out these messages in stages to an initial core of 12 customer groups,
each of which was defined by 4,000 attributes. One group, for
example, tended to stay on weekends, redeem reward points for gift
cards, and register through IHG marketing partners. Using this
information, IHG sent these customers a marketing message that
alerted them to local weekend events.
The campaign proved to be highly successful. It generated a 35 percent
higher rate of customer conversions, or acceptances, than previous,
similar campaigns.

Government Operations.
Consider the United Kingdom. According to the INRIX Traffic
Scorecard, although the United States has the worst traffic congestion
on average, London topped the world list for metropolitan areas. In
London, drivers wasted an average of 101 hours per year in gridlock.
Congestion is bad for business. The INRIX study estimated that the
cost to the U.K. economy would be £307 billion ($400 billion)
between 2013 and 2030.
Congestion is also harmful to urban resilience, negatively affecting
both environmental and social sustainability, in terms of emissions,
global warming, air quality, and public health. As for the livability of a
modern city, congestion is an important component of the urban
transport user experience (UX).
Calculating levels of UX satisfaction at any given time involves solving
a complex equation with a range of key variables and factors: total
number of transport assets (road and rail capacity, plus parking
spaces), users (vehicles, pedestrians), incidents (roadwork, accidents,
breakdowns), plus expectations (anticipated journey times and
passenger comfort).
The growing availability of Big Data sources within London—for
example, traffic cameras and sensors on cars and roadways—can help
to create a new era of smart transport. Analyzing this Big Data offers
new ways for traffic analysts in London to “sense the city” and enhance
transport via real-time estimation of traffic patterns and rapid
deployment of traffic management strategies.

Before you go on …
1. Define Big Data.
2. Describe the characteristics of Big Data.
3. Describe how companies can use Big Data to a gain
competitive advantage.

5.4 Data Warehouses and Data Marts


Today, the most successful companies are those that can respond
quickly and flexibly to market changes and opportunities. The key to
such a response is how analysts and managers effectively and
efficiently use data and information. The challenge is to provide users
with access to corporate data so they can analyze the data to make
better decisions. Let’s consider an example. If the manager of a local
bookstore wanted to know the profit margin on used books at her
store, then she could obtain that information from her database using
SQL or query-by-example (QBE). QBE is a method of creating
database queries that allows the user to search for documents based
on an example in the form of a selected string of text or in the form of
a document name or a list of documents. However, if she needed to
know the trend in the profit margins on used books over the past 10
years, then she would have to construct a very complicated SQL or
QBE query.

Author Lecture Videos are available exclusively in


WileyPLUS.
Apply the Concept activities are available in the Appendix
and in WileyPLUS.

This example illustrates several reasons why organizations are


building data warehouses and data marts. First, the bookstore’s
databases contain the necessary information to answer the manager’s
query, but this information is not organized in a way that makes it easy
for her to find what she needs. Therefore, complicated queries might
take a long time to answer, and they also might degrade the
performance of the databases. Second, transactional databases are
designed to be updated. The update process requires extra processing.
Data warehouses and data marts are read-only. Therefore, the extra
processing is eliminated because data already contained in the data
warehouse are not updated. Third, transactional databases are
designed to access a single record at a time. In contrast, data
warehouses are designed to access large groups of related records.
As a result of these problems, companies are using a variety of tools
with data warehouses and data marts to make it easier and faster for
users to access, analyze, and query data. You will learn about these
tools in Chapter 12 on Business Analytics.
Describing Data Warehouses and Data Marts
In general, data warehouses and data marts support business analytics
applications. As you will see in Chapter 12, business analytics
encompasses a broad category of applications, technologies, and
processes for gathering, storing, accessing, and analyzing data to help
business users make better decisions. A data warehouse is a
repository of historical data that are organized by subject to support
decision makers within an organization.
Because data warehouses are so expensive, they are used primarily by
large companies. A data mart is a low-cost, scaled-down version of a
data warehouse that is designed for the end-user needs in a strategic
business unit (SBU) or an individual department. Data marts can be
implemented more quickly than data warehouses, often in fewer than
90 days. Furthermore, they support local rather than central control
by conferring power on the user group. Typically, groups that need a
single or a few business analytics applications require only a data mart
rather than a data warehouse.
The basic characteristics of data warehouses and data marts include
the following:
Organized by business dimension or subject: Data are organized
by subject—for example, by customer, vendor, product, price
level, and region. This arrangement differs from transactional
systems, where data are organized by business process such as
order entry, inventory control, and accounts receivable.
Use online analytical processing: Typically, organizational
databases are oriented toward handling transactions. That is,
databases use online transaction processing (OLTP), where
business transactions are processed online as soon as they occur.
The objectives are speed and efficiency, which are critical to a
successful Internet-based business operation. In contrast, data
warehouses and data marts, which are designed to support
decision makers but not OLTP, use online analytical processing
(OLAP), which involves the analysis of accumulated data by end
users. We consider OLAP in greater detail in Chapter 12.
Integrated: Data are collected from multiple systems and are then
integrated around subjects. For example, customer data may be
extracted from internal (and external) systems and then
integrated around a customer identifier, thereby creating a
comprehensive view of the customer.
Time variant: Data warehouses and data marts maintain
historical data; that is, data that include time as a variable. Unlike
transactional systems, which maintain only recent data (such as
for the last day, week, or month), a warehouse or mart may store
years of data. Organizations use historical data to detect
deviations, trends, and long-term relationships.
Nonvolatile: Data warehouses and data marts are nonvolatile—
that is, users cannot change or update the data. Therefore, the
warehouse or mart reflects history, which, as we just saw, is
critical for identifying and analyzing trends. Warehouses and
marts are updated, but through IT-controlled load processes
rather than by users.
Multidimensional: Typically, the data warehouse or mart uses a
multidimensional data structure. Recall that relational databases
store data in two-dimensional tables. In contrast, data
warehouses and marts store data in more than two dimensions.
For this reason, the data are said to be stored in a
multidimensional structure. A common representation for
this multidimensional structure is the data cube.
The data in data warehouses and marts are organized by business
dimensions, which are subjects such as product, geographic area, and
time period that represent the edges of the data cube. If you look
ahead to Figure 5.6 for an example of a data cube, you see that the
product dimension is composed of nuts, screws, bolts, and washers;
the geographic area dimension is composed of East, West, and
Central; and the time period dimension is composed of 2016, 2017,
and 2018. Users can view and analyze data from the perspective of
these business dimensions. This analysis is intuitive because the
dimensions are presented in business terms that users can easily
understand.
A Generic Data Warehouse Environment
The environment for data warehouses and marts includes the
following:
Source systems that provide data to the warehouse or mart
Data-integration technology and processes that prepare the data
for use
Different architectures for storing data in an organization’s data
warehouse or data marts
Different tools and applications for the variety of users. (You will
learn about these tools and applications in Chapter 12.)
Metadata (data about the data in a repository), data quality, and
governance processes that ensure that the warehouse or mart
meets its purposes
Figure 5.4 depicts a generic data warehouse or data mart
environment. Let’s drill down into the component parts.

FIGURE 5.4 Data warehouse framework.


Source Systems.
There is typically some “organizational pain point”—that is, a business
need—that motivates a firm to develop its business intelligence
capabilities. Working backward, this pain leads to information
requirements, BI applications, and requirements for source system
data. These data requirements can range from a single source system,
as in the case of a data mart, to hundreds of source systems, as in the
case of an enterprisewide data warehouse.
Modern organizations can select from a variety of source systems,
including operational/transactional systems, enterprise resource
planning (ERP) systems, website data, third-party data (e.g., customer
demographic data), and more. The trend is to include more types of
data (e.g., sensing data from RFID tags). These source systems often
use different software packages (e.g., IBM, Oracle), and they store data
in different formats (e.g., relational, hierarchical).
A common source for the data in data warehouses is the company’s
operational databases, which can be relational databases. To
differentiate between relational databases and multidimensional data
warehouses and marts, imagine your company manufactures four
products—nuts, screws, bolts, and washers—and has sold them in
three territories—East, West, and Central—for the previous three years
—2016, 2017, and 2018. In a relational database, these sales data
would resemble Figure 5.5(a) through (c). In a multidimensional
database, in contrast, these data would be represented by a three-
dimensional matrix (or data cube), as depicted in Figure 5.6. This
matrix represents sales dimensioned by products, regions, and year.
Notice that Figure 5.5(a) presents only sales for 2016. Sales for 2017
and 2018 are presented in Figure 5.5(b) and (c), respectively. Figure
5.7(a) through (c) illustrates the equivalence between these relational
and multidimensional databases.
FIGURE 5.5 Relational databases.

FIGURE 5.6 Data cube.


FIGURE 5.7 Equivalence between relational and
multidimensional databases.
Unfortunately, many source systems that have been in use for years
contain “bad data”—for example, missing or incorrect data—and they
are poorly documented. As a result, data-profiling software should be
used at the beginning of a warehousing project to better understand
the data. Among other things, this software can provide statistics on
missing data, identify possible primary and foreign keys, and reveal
how derived values—for example, column 3 = column 1 + column 2—
are calculated. Subject area database specialists such as marketing and
human resources personnel can also assist in understanding and
accessing the data in source systems.
Organizations need to address other source systems issues as well. For
example, many organizations maintain multiple systems that contain
some of the same data. These enterprises need to select the best
system as the source system. Organizations must also decide how
granular, or detailed, the data should be. For example, does the
organization need daily sales figures or data for individual
transactions? The conventional wisdom is that it is best to store data at
a highly granular level because someone will likely request those data
at some point.

Data Integration.
In addition to storing data in their source systems, organizations need
to extract the data, transform them, and then load them into a data
mart or warehouse. This process is often called ETL, although the term
data integration is increasingly being used to reflect the growing
number of ways that source system data can be handled. For example,
in some cases, data are extracted, loaded into a mart or warehouse,
and then transformed (i.e., ELT rather than ETL).
Data extraction can be performed either by handwritten code such as
SQL queries or by commercial data-integration software. Most
companies employ commercial software. This software makes it
relatively easy to (1) specify the tables and attributes in the source
systems that are to be used; (2) map and schedule the movement of
the data to the target, such as a data mart or warehouse; (3) make the
required transformations; and, ultimately, (4) load the data.
After the data are extracted, they are transformed to make them more
useful. For example, data from different systems may be integrated
around a common key, such as a customer identification number.
Organizations adopt this approach to create a 360-degree view of all of
their interactions with their customers. As an example of this process,
consider a bank. Customers can engage in a variety of interactions:
visiting a branch, banking online, using an ATM, obtaining a car loan,
and more. The systems for these touch points—defined as the
numerous ways that organizations interact with customers, such as e-
mail, the Web, direct contact, and the telephone—are typically
independent of one another. To obtain a holistic picture of how
customers are using the bank, the bank must integrate the data from
the various source systems into a data mart or warehouse.
Other kinds of transformations also take place. For example, format
changes to the data may be required, such as using male and female to
denote gender, as opposed to 0 and 1 or M and F. Aggregations may be
performed, say on sales figures, so that queries can use the summaries
rather than recalculating them each time. Data-cleansing software
may be used to clean up the data; for example, eliminating duplicate
records for the same customer.
Finally, data are loaded into the warehouse or mart during a specified
period known as the “load window.” This window is becoming smaller
as companies seek to store ever-fresher data in their warehouses. For
this reason, many companies have moved to real-time data
warehousing, where data are moved using data-integration processes
from source systems to the data warehouse or mart almost instantly.
For example, within 15 minutes of a purchase at Walmart, the details
of the sale have been loaded into a warehouse and are available for
analysis.

Storing the Data.


Organizations can choose from a variety of architectures to store
decision-support data. The most common architecture is one central
enterprise data warehouse, without data marts. Most organizations
use this approach because the data stored in the warehouse are
accessed by all users, and they represent the single version of the
truth.
Another architecture is independent data marts. These marts store
data for a single application or a few applications, such as marketing
and finance. Organizations that employ this architecture give only
limited thought to how the data might be used for other applications
or by other functional areas in the organization. Clearly this is a very
application-centric approach to storing data.
The independent data mart architecture is not particularly effective.
Although it may meet a specific organizational need, it does not reflect
an enterprisewide approach to data management. Instead, the various
organizational units create independent data marts. Not only are these
marts expensive to build and maintain but they also often contain
inconsistent data. For example, they may have inconsistent data
definitions such as: What is a customer? Is a particular individual a
potential or a current customer? They might also use different source
systems, which can have different data for the same item, such as a
customer address (if the customer had moved). Although independent
data marts are an organizational reality, larger companies have
increasingly moved to data warehouses.
Still another data warehouse architecture is the hub and spoke. This
architecture contains a central data warehouse that stores the data
plus multiple dependent data marts that source their data from the
central repository. Because the marts obtain their data from the
central repository, the data in these marts still comprise the single
version of the truth for decision-support purposes.
The dependent data marts store the data in a format that is
appropriate for how the data will be used and for providing faster
response times to queries and applications. As you have learned, users
can view and analyze data from the perspective of business
dimensions and measures. This analysis is intuitive because the
dimensions are presented in business terms that users can easily
understand.
Metadata.
It is important to maintain data about the data, known as metadata, in
the data warehouse. Both the IT personnel who operate and manage
the data warehouse and the users who access the data require
metadata. IT personnel need information about data sources;
database, table, and column names; refresh schedules; and data-usage
measures. Users’ needs include data definitions, report and query
tools, report distribution information, and contact information for the
help desk.

Data Quality.
The quality of the data in the warehouse must meet users’ needs. If it
does not, then users will not trust the data and ultimately will not use
it. Most organizations find that the quality of the data in source
systems is poor and must be improved before the data can be used in
the data warehouse. Some of the data can be improved with data-
cleansing software. The better, long-term solution, however, is to
improve the quality at the source system level. This approach requires
the business owners of the data to assume responsibility for making
any necessary changes to implement this solution.
To illustrate this point, consider the case of a large hotel chain that
wanted to conduct targeted marketing promotions using zip code data
it collected from its guests when they checked in. When the company
analyzed the zip code data, they discovered that many of the zip codes
were 99999. How did this error occur? The answer is that the clerks
were not asking customers for their zip codes, but they needed to enter
something to complete the registration process. A short-term solution
to this problem was to conduct the marketing campaign using city and
state data instead of zip codes. The long-term solution was to make
certain the clerks entered the actual zip codes. The latter solution
required the hotel managers to assume responsibility for making
certain their clerks enter the correct data.

Governance.
To ensure that BI is meeting their needs, organizations must
implement governance to plan and control their BI activities.
Governance requires that people, committees, and processes be in
place. Companies that are effective in BI governance often create a
senior-level committee composed of vice presidents and directors who
(1) ensure that the business strategies and BI strategies are in
alignment, (2) prioritize projects, and (3) allocate resources. These
companies also establish a middle management–level committee that
oversees the various projects in the BI portfolio to ensure that these
projects are being completed in accordance with the company’s
objectives. Finally, lower-level operational committees perform tasks
such as creating data definitions and identifying and solving data
problems. All of these committees rely on the collaboration and
contributions of business users and IT personnel.

Users.
Once the data are loaded in a data mart or warehouse, they can be
accessed. At this point, the organization begins to obtain business
value from BI; all of the prior stages constitute creating BI
infrastructure.
There are many potential BI users, including IT developers; frontline
workers; analysts; information workers; managers and executives; and
suppliers, customers, and regulators. Some of these users are
information producers whose primary role is to create information for
other users. IT developers and analysts typically fall into this category.
Other users—including managers and executives—are information
consumers, because they use information created by others.
Companies have reported hundreds of successful data-warehousing
applications. You can read client success stories and case studies at the
websites of vendors such as NCR Corp. (www.ncr.com) and Oracle
(www.oracle.com). For a more detailed discussion, visit the Data
Warehouse Institute (www.tdwi.org). The benefits of data
warehousing include the following:
End users can access needed data quickly and easily through Web
browsers because these data are located in one place.
End users can conduct extensive analysis with data in ways that
were not previously possible.
End users can obtain a consolidated view of organizational data.
These benefits can improve business knowledge, provide competitive
advantage, enhance customer service and satisfaction, facilitate
decision making, and streamline business processes.
Despite their many benefits, data warehouses have some limitations.
IT’s About Business 5.3 points out these limitations and considers an
emerging solution; namely, data lakes.

IT’s About Business 5.3


Data Lakes

Most large organizations have an enterprise data warehouse


(EDW) which contains data from other enterprise systems such as
customer relationship management (CRM), inventory, and sales
transaction systems. Analysts and business users examine the data
in EDWs to make business decisions.
Despite their benefits to organizations, EDWs do have problems.
Specifically, they employ a schema-on-write architecture which is
the foundation for the underlying extract, transform, and load
(ETL) process required to enter data into the EDW. With schema-
on-write, enterprises must design the data model before they load
any data. That is, organizations must first define the schema, then
write (load) the data, and finally read the data. To carry out this
process, they need to know ahead of time how they plan to use the
data.
A database schema defines the structure of both the database and
the data contained in the database. For example, in the case of
relational databases, the schema specifies the tables and fields of
the database. A database schema also describes the content and
structure of the physical data stored, which is sometimes called
metadata. Metadata can include information about data types,
relationships, content restrictions, access controls, and many other
types of information.
Because EDWs contain only data that conform to the prespecified
enterprise data model, they are relatively inflexible and can answer
only a limited number of questions. It is therefore difficult for
business analysts and data scientists who rely on EDWs to ask ad
hoc questions of the data. Instead, they have to form hypotheses in
advance and then create the data structures to test those
hypotheses.
It is also difficult for EDWs to manage new sources of data, such as
streaming data from sensors (see the Internet of Things in Chapter
8) and social media data such as blog postings, ratings,
recommendations, product reviews, tweets, photographs, and
video clips.
EDWs have been the primary mechanism in many organizations
for performing analytics, reporting, and operations. However, they
are too rigid to be effective with Big Data, with large data volumes,
with a broad variety of data, and with high data velocity. As a result
of these problems, organizations have begun to realize that EDWs
cannot meet all their business needs. As an alternative, they are
beginning to deploy data lakes.
A data lake is a central repository that stores all of an
organization’s data, regardless of their source or format. Data lakes
receive data in any format, both structured and unstructured. Also,
the data do not have to be consistent. For example, organizations
may have the same type of information in different data formats,
depending on where the data originate.
Data lakes are typically, though not always, built using Apache
Hadoop (www.hadoop.apache.org). Organizations can then
employ a variety of storage and processing tools to extract value
quickly and to inform key business decisions.
In contrast to the schema-on write architecture utilized by EDWs,
data lakes use schema-on-read architectures. That is, they do not
transform the data before the data are entered into the data lake as
they would for an EDW. The structure of the data is not known
when the data are fed into the data lake. Rather, it is discovered
only when the data are read. With schema-on-read, users do not
model the data until they actually use the data. This process is
more flexible, and it makes it easier for users to discover new data
and to enter new data sources into the data lake.
Data lakes provide many benefits for organizations:
Organizations can derive value from unlimited types of data.
Organizations do not need to have all of the answers in
advance.
Organizations have no limits on how they can query the data.
Organizations do not create silos. Instead, data lakes provide a
single, unified view of data across the organization.
To load data into a data lake, organizations should take the
following steps:
Define the incoming data from a business perspective.
Document the context, origin, and frequency of the incoming
data.
Classify the security level (public, internal, sensitive,
restricted) of the incoming data.
Document the creation, usage, privacy, regulatory, and
encryption business rules that apply to the incoming data.
Identify the owner (sponsor) of the incoming data.
Identify the data steward(s) charged with monitoring the
health of the specific datasets.
After following these steps, organizations load all of the data into a
giant table. Each piece of data—whether a customer’s name, a
photograph, or a Facebook post—is placed in an individual cell. It
does not matter where in the data lake that individual cell is
located, where the data came from, or its format, because all of the
data are connected through metadata tags. Organizations can add
or change these tags as requirements evolve. Further, they can
assign multiple tags to the same piece of data. Because the schema
for storing the data does not need to be defined in advance,
expensive and time-consuming data modeling is not needed.
Organizations can also protect sensitive information. As data is
loaded into the data lake, each cell is tagged according to how
visible it is to different users in the organization. That is,
organizations can specify who has access to the data in each cell,
and under what circumstances, as the data is loaded.
For example, a retail operation might make cells containing
customers’ names and contact data available to sales and customer
service, but it might make the cells containing more sensitive
personally identifiable information or financial data available only
to the finance department. In that way, when users run queries on
the data, their access rights restrict which data they can access.
There are many examples of data lakes in practice. Let’s consider
EMC (www.dellemc.com), an industry leader in storage and
analytics technologies. EMC promotes data-driven business
transformation for its customers by helping them store, manage,
protect, and analyze their most valuable asset—data. After EMC
acquired more than 80 companies over the past decade, the
company’s IT department ended up controlling multiple data silos.
Furthermore, EMC did not have an effective way to deliver on-
demand analytics to business units. Although data reporting
helped EMC with descriptive analytics (see Chapter 12), the
company could not exploit the potential of predictive analytics to
create targeted marketing and lead generation campaigns.
To overcome these limitations, EMC consolidated multiple data
silos into a data lake, which contained all of the firm’s structured
and unstructured data. Examples of data in EMC’s data lake are
customer information (such as past purchases), contact
demographics, interests, marketing history, data from social media
and the Web, and sensor data. EMC also implemented data
governance in its data lake so that its business groups could share
data and collaborate on initiatives involving analytics.
Using its data lake, EMC was able to microsegment customers
for more relevant and targeted marketing communications.
Removing the data silos greatly enhanced EMC marketing
managers’ understanding of the firm’s customers and their
purchasing behaviors.
The data lake also enables employees to make near-real-time
decisions about which data to analyze. Significantly, the time
required to complete some queries has fallen from four hours per
quarter to fewer than one minute per year. The reason for the
decrease is that at the beginning of the query EMC data scientists
do not have to commit to which data and analytic technique they
will use. For example, data scientists might take an initial look at
more than 1,000 variables, some of which are probably redundant.
The analytics software will select the most powerful variables for
prediction, resulting in the most efficient and accurate predictive
model.
Because the data lake contains such wide-ranging and diverse data,
EMC’s predictive models are increasingly accurate. In fact, EMC
claims it can correctly predict what a customer is going to purchase
and when 80 percent of the time.
Sources: Compiled from T. King, “Three Key Data Lake Trends to Stay on
Top of This Year,” Solutions Review, May 11, 2018; T. Olavsrud, “6 Data
Analytics Trends that Will Dominate 2018,” CIO, March 15, 2018; P. Tyaqi
and H. Demirkan, “Data Lakes: The Biggest Big Data Challenges,” Analytics
Magazine, September/October, 2017; M. Hagstroem, M. Roggendorf, T.
Saleh, and J. Sharma, “A Smarter Way to Jump into Data Lakes,” McKinsey
and Company, August, 2017; P. Barth, “The New Paradigm for Big Data
Governance,” CIO, May 11, 2017; N. Mikhail, “Why Big Data Kills
Businesses,” Fortune, February 28, 2017; “Architecting Data Lakes,”
Zaloni, February 21, 2017; D. Kim, “Successful Data Lakes: A Growing
Trend,” The Data Warehousing Institute, February 16, 2017; D. Woods,
“Data Lakes and Frosted Flakes—Should You Buy Both off the Shelf?”
Forbes, January 24, 2017; S. Carey, “Big Data and Business Intelligence
Trends 2017: Machine Learning, Data Lakes, and Hadoop vs Spark,”
Computerworld UK, December 28, 2016; D. Woods, “Why Data Lakes Are
Evil,” Forbes, August 26, 2016; J. Davis, “How TD Bank Is Transforming Its
Data Infrastructure,” InformationWeek, May 25, 2016; L. Hester,
“Maximizing Data Value with a Data Lake,” Data Science Central, April 20,
2016; “Extracting Insight and Value from a Lake of Data,” Intel Case Study,
2016; and S. Gittlen, “Data Lakes: A Better Way to Analyze Customer Data,”
Computerworld, February 25, 2016.
Questions
1. Discuss the limitations of enterprise data warehouses.
2. Describe the benefits of a data lake to organizations.

Before you go on …
1. Differentiate between data warehouses and data marts.
2. Describe the characteristics of a data warehouse.
3. What are three possible architectures for data warehouses and
data marts in an organization?

5.5 Knowledge Management


As we have noted throughout this text, data and information are vital
organizational assets. Knowledge is a vital asset as well. Successful
managers have always valued and used intellectual assets. These
efforts were not systematic, however, and they did not ensure that
knowledge was shared and dispersed in a way that benefited the
overall organization. Moreover, industry analysts estimate that most
of a company’s knowledge assets are not housed in relational
databases. Instead, they are dispersed in e-mail, word processing
documents, spreadsheets, presentations on individual computers, and
in people’s heads. This arrangement makes it extremely difficult for
companies to access and integrate this knowledge. The result
frequently is less-effective decision making.

Author Lecture Videos are available exclusively in


WileyPLUS.
Apply the Concept activities are available in the Appendix
and in WileyPLUS.
Concepts and Definitions
Knowledge management (KM) is a process that helps
organizations manipulate important knowledge that comprises part of
the organization’s memory, usually in an unstructured format. For an
organization to be successful, knowledge, as a form of capital, must
exist in a format that can be exchanged among persons. It must also be
able to grow.

Knowledge.
In the information technology context, knowledge is distinct from data
and information. As you learned in Chapter 1, data are a collection of
facts, measurements, and statistics; information is organized or
processed data that are timely and accurate. Knowledge is information
that is contextual, relevant, and useful. Simply put, knowledge is
information in action. Intellectual capital (or intellectual assets)
is another term for knowledge.
To illustrate, a bulletin listing all of the courses offered by your
university during one semester would be considered data. When you
register, you process the data from the bulletin to create your schedule
for the semester. Your schedule would be considered information.
Awareness of your work schedule, your major, your desired social
schedule, and characteristics of different faculty members could be
construed as knowledge, because it can affect the way you build your
schedule. You see that this awareness is contextual and relevant (to
developing an optimal schedule of classes) as well as useful (it can lead
to changes in your schedule). The implication is that knowledge has
strong experiential and reflective elements that distinguish it from
information in a given context. Unlike information, knowledge can be
used to solve a problem.
Numerous theories and models classify different types of knowledge.
In the next section, we will focus on the distinction between explicit
knowledge and tacit knowledge.

Explicit and Tacit Knowledge.


Explicit knowledge deals with more objective, rational, and
technical knowledge. In an organization, explicit knowledge consists of
the policies, procedural guides, reports, products, strategies, goals,
core competencies, and IT infrastructure of the enterprise. In other
words, explicit knowledge is the knowledge that has been codified
(documented) in a form that can be distributed to others or
transformed into a process or a strategy. A description of how to
process a job application that is documented in a firm’s human
resources policy manual is an example of explicit knowledge.
In contrast, tacit knowledge is the cumulative store of subjective or
experiential learning. In an organization, tacit knowledge consists of
an organization’s experiences, insights, expertise, know-how, trade
secrets, skill sets, understanding, and learning. It also includes the
organizational culture, which reflects the past and present experiences
of the organization’s people and processes, as well as the
organization’s prevailing values. Tacit knowledge is generally
imprecise and costly to transfer. It is also highly personal. Finally,
because it is unstructured, it is difficult to formalize or codify, in
contrast to explicit knowledge. A salesperson who has worked with
particular customers over time and has come to know their needs
quite well would possess extensive tacit knowledge. This knowledge is
typically not recorded. In fact, it might be difficult for the salesperson
to put into writing, even if he or she were willing to share it.

Knowledge Management Systems


The goal of knowledge management is to help an organization make
the most productive use of the knowledge it has accumulated.
Historically, management information systems have focused on
capturing, storing, managing, and reporting explicit knowledge.
Organizations now realize they need to integrate explicit and tacit
knowledge into formal information systems. Knowledge
management systems (KMSs) refer to the use of modern
information technologies—the Internet, intranets, extranets, and
databases—to systematize, enhance, and expedite knowledge
management both within one firm and among multiple firms. KMSs
are intended to help an organization cope with turnover, rapid change,
and downsizing by making the expertise of the organization’s human
capital widely accessible.
Organizations can realize many benefits with KMSs. Most important,
they make best practices—the most effective and efficient ways to
accomplish business processes—readily available to a wide range of
employees. Enhanced access to best-practice knowledge improves
overall organizational performance. For example, account managers
can now make available their tacit knowledge about how best to
manage large accounts. The organization can then use this knowledge
when it trains new account managers. Other benefits include
enhanced customer service, more efficient product development, and
improved employee morale and retention.
At the same time, however, implementing effective KMSs presents
several challenges. First, employees must be willing to share their
personal tacit knowledge. To encourage this behavior, organizations
must create a knowledge management culture that rewards employees
who add their expertise to the knowledge base. Second, the
organization must continually maintain and upgrade its knowledge
base. Specifically, it must incorporate new knowledge and delete old,
outdated knowledge. Finally, companies must be willing to invest in
the resources needed to carry out these operations.

The KMS Cycle


A functioning KMS follows a cycle that consists of six steps (see
Figure 5.8). The reason the system is cyclical is that knowledge is
dynamically refined over time. The knowledge in an effective KMS is
never finalized because the environment changes over time and
knowledge must be updated to reflect these changes.
FIGURE 5.8 The knowledge management system cycle.
The cycle works as follows:
1. Create knowledge: Knowledge is created as people determine
new ways of doing things or develop know-how. Sometimes
external knowledge is brought in.
2. Capture knowledge: New knowledge must be identified as
valuable and be presented in a reasonable way.
3. Refine knowledge: New knowledge must be placed in context so
that it is actionable. This is where tacit qualities (human insights)
must be captured along with explicit facts.
4. Store knowledge: Useful knowledge must then be stored in a
reasonable format in a knowledge repository so that other people
in the organization can access it.
5. Manage knowledge: Like a library, the knowledge must be kept
current. Therefore, it must be reviewed regularly to verify that it is
relevant and accurate.
6. Disseminate knowledge: Knowledge must be made available in a
useful format to anyone in the organization who needs it,
anywhere and any time.
Before you go on …
1. What is knowledge management?
2. What is the difference between tacit knowledge and explicit
knowledge?
3. Describe the knowledge management system cycle.

5.6 Appendix: Fundamentals of Relational


Database Operations
There are many operations possible with relational databases. In this
section, we discuss three of these operations: query languages,
normalization, and joins.

Author Lecture Videos are available exclusively in


WileyPLUS.
Apply the Concept activities are available in the Appendix
and in WileyPLUS.

As you have seen in this chapter, a relational database is a collection of


interrelated 2D tables, consisting of rows and columns. Each row
represents a record, and each column (or field) represents an attribute
(or characteristic) of that record. Every record in the database must
contain at least one field that uniquely identifies that record so that it
can be retrieved, updated, and sorted. This identifier field, or group of
fields, is called the primary key. In some cases, locating a particular
record requires the use of secondary keys. A secondary key is another
field that has some identifying information, but typically does not
uniquely identify the record. A foreign key is a field (or group of fields)
in one table that matches the primary key value in a row of another
table. A foreign key is used to establish and enforce a link between two
tables.
These related tables can be joined when they contain common
columns. The uniqueness of the primary key tells the DBMS which
records are joined with others in related tables. This feature allows
users great flexibility in the variety of queries they can make. Despite
these features, however, the relational database model has some
disadvantages. Because large-scale databases can be composed of
many interrelated tables, the overall design can be complex, leading to
slow search and access times.

Query Languages
The most commonly performed database operation is searching for
information. Structured query language (SQL) is the most
popular query language used for interacting with a database. SQL
allows people to perform complicated searches by using relatively
simple statements or key words. Typical key words are SELECT (to
choose a desired attribute), FROM (to specify the table or tables to be
used), and WHERE (to specify conditions to apply in the query).
To understand how SQL works, imagine that a university wants to
know the names of students who will graduate cum laude (but not
magna or summa cum laude) in May 2018. (Refer to Figure 5.3 in this
chapter.) The university IT staff would query the student relational
database with an SQL statement such as the following:
SELECT Student_Name
FROM Student_Database
WHERE Grade_Point_Average > = 3.40 and
Grade_Point_Average < 3.60.
The SQL query would return John Jones and Juan Rodriguez.
Another way to find information in a database is to use query by
example (QBE). In QBE, the user fills out a grid or template—also
known as a form—to construct a sample or a description of the data
desired. Users can construct a query quickly and easily by using drag-
and-drop features in a DBMS such as Microsoft Access. Conducting
queries in this manner is simpler than keying in SQL commands.

Entity–Relationship Modeling
Designers plan and create databases through the process of entity–
relationship (ER) modeling, using an entity–relationship
(ER) diagram. There are many approaches to ER diagramming. You
will see one particular approach here, but there are others. The good
news is that if you are familiar with one version of ER diagramming,
then you will be able to easily adapt to any other version.
ER diagrams consist of entities, attributes, and relationships. To
properly identify entities, attributes, and relationships, database
designers first identify the business rules for the particular data model.
Business rules are precise descriptions of policies, procedures, or
principles in any organization that stores and uses data to generate
information. Business rules are derived from a description of an
organization’s operations, and help to create and enforce business
processes in that organization. Keep in mind that you determine these
business rules, not the MIS department.
Entities are pictured in rectangles, and relationships are described on
the line between two entities. The attributes for each entity are listed,
and the primary key is underlined. The data dictionary provides
information on each attribute, such as its name, if it is a key, part of a
key, or a non-key attribute; the type of data expected (alphanumeric,
numeric, dates, etc.); and valid values. Data dictionaries can also
provide information on why the attribute is needed in the database;
which business functions, applications, forms, and reports use the
attribute; and how often the attribute should be updated.
ER modeling is valuable because it allows database designers to
communicate with users throughout the organization to ensure that all
entities and the relationships among the entities are represented. This
process underscores the importance of taking all users into account
when designing organizational databases. Notice that all entities and
relationships in our example are labeled in terms that users can
understand.
Relationships illustrate an association between entities. The degree
of a relationship indicates the number of entities associated with a
relationship. A unary relationship exists when an association is
maintained within a single entity. A binary relationship exists
when two entities are associated. A ternary relationship exists
when three entities are associated. In this chapter, we discuss only
binary relationships because they are the most common. Entity
relationships may be classified as one-to-one, one-to-many, or many-
to-many. The term connectivity describes the relationship
classification.
Connectivity and cardinality are established by the business rules of a
relationship. Cardinality refers to the maximum number of times an
instance of one entity can be associated with an instance in the related
entity. Cardinality can be mandatory single, optional single,
mandatory many, or optional many. Figure 5.9 displays the
cardinality symbols. Note that we have four possible cardinality
symbols: mandatory single, optional single, mandatory many, and
optional many.

FIGURE 5.9 Cardinality symbols.


Let’s look at an example from a university. An entity is a person, place,
or thing that can be identified in the users’ work environment. For
example, consider student registration at a university. Students
register for courses, and they also register their cars for parking
permits. In this example, STUDENT, PARKING PERMIT, CLASS, and
PROFESSOR are entities. Recall that an instance of an entity
represents a particular student, parking permit, class, or professor.
Therefore, a particular STUDENT (James Smythe, 8023445) is an
instance of the STUDENT entity; a particular parking permit (91778)
is an instance of the PARKING PERMIT entity; a particular class
(76890) is an instance of the CLASS entity; and a particular professor
(Margaret Wilson, 390567) is an instance of the PROFESSOR entity.
Entity instances have identifiers, or primary keys, which are
attributes (attributes and identifiers are synonymous) that are unique
to that entity instance. For example, STUDENT instances can be
identified with Student Identification Number; PARKING PERMIT
instances can be identified with Permit Number; CLASS instances can
be identified with Class Number; and PROFESSOR instances can be
identified with Professor Identification Number.
Entities have attributes, or properties, that describe the entity’s
characteristics. In our example, examples of attributes for STUDENT
are Student Name and Student Address. Examples of attributes for
PARKING PERMIT are Student Identification Number and Car Type.
Examples of attributes for CLASS are Class Name, Class Time, and
Class Place. Examples of attributes for PROFESSOR are Professor
Name and Professor Department. (Note that each course at this
university has one professor—no team teaching.)
Why is Student Identification Number an attribute of both the
STUDENT and PARKING PERMIT entity classes? That is, why do we
need the PARKING PERMIT entity class? If you consider all of the
interlinked university systems, the PARKING PERMIT entity class is
needed for other applications, such as fee payments, parking tickets,
and external links to the state’s Department of Motor Vehicles.
Let’s consider the three types of binary relationships in our example.
In a one-to-one (1:1) relationship, a single-entity instance of one type
is related to a single-entity instance of another type. In our university
example, STUDENT–PARKING PERMIT is a 1:1 relationship. The
business rule at this university represented by this relationship is:
Students may register only one car at this university. Of course,
students do not have to register a car at all. That is, a student can have
only one parking permit but does not need to have one.
Note that the relationship line on the PARKING PERMIT side shows a
cardinality of optional single. A student can have, but does not have to
have, a parking permit. On the STUDENT side of the relationship, only
one parking permit can be assigned to one student, resulting in a
cardinality of mandatory single. See Figure 5.10.

FIGURE 5.10 One-to-one relationship.


The second type of relationship, one-to-many (1:M), is represented by
the CLASS–PROFESSOR relationship in Figure 5.11. The business
rule at this university represented by this relationship is the following:
At this university, there is no team teaching. Therefore, each class
must have only one professor. On the other hand, professors may
teach more than one class. Note that the relationship line on the
PROFESSOR side shows a cardinality of mandatory single. In
contrast, the relationship line on the CLASS side shows a cardinality of
optional many.

FIGURE 5.11 Departments and documents flow in the


procurement process.
The third type of relationship, many-to-many (M:M), is represented
by the STUDENT–CLASS relationship. Most database management
systems do not support many-to-many relationships. Therefore, we
use junction (or bridge) tables so that we have two one-to-many
relationships. The business rule at this university represented by this
relationship is: Students can register for one or more classes, and each
class can have one or more students (see Figure 5.12). In this
example, we create the REGISTRATION table as our junction table.
Note that Student ID and Class ID are foreign keys in the
REGISTRATION table.

FIGURE 5.12 Many-to-many relationship.


Let’s examine the following relationships:
The relationship line on the STUDENT side of the STUDENT–
REGISTRATION relationship shows a cardinality of optional
single.
The relationship line on the REGISTRATION side of the
STUDENT–REGISTRATION relationship shows a cardinality of
optional many.
The relationship line on the CLASS side of the CLASS–
REGISTRATION relationship shows a cardinality of optional
single.
The relationship line on the REGISTRATION side of the CLASS–
REGISTRATION relationship shows a cardinality of optional
many.

Normalization and Joins


To use a relational database management system efficiently and
effectively, the data must be analyzed to eliminate redundant data
elements. Normalization is a method for analyzing and reducing a
relational database to its most streamlined form to ensure minimum
redundancy, maximum data integrity, and optimal processing
performance. Data normalization is a methodology for organizing
attributes into tables so that redundancy among the non-key attributes
is eliminated. The result of the data normalization process is a
properly structured relational database.
Data normalization requires a list of all the attributes that must be
incorporated into the database and a list of all of the defining
associations, or functional dependencies, among the attributes.
Functional dependencies are a means of expressing that the value
of one particular attribute is associated with a specific single value of
another attribute. For example, for a Student Number 05345 at a
university, there is exactly one Student Name, John C. Jones,
associated with it. That is, Student Number is referred to as the
determinant because its value determines the value of the other
attribute. We can also say that Student Name is functionally
dependent on Student Number.
As an example of normalization, consider a pizza shop. This shop takes
orders from customers on a form. Figure 5.13 shows a table of
nonnormalized data gathered by the pizza shop. This table has two
records, one for each order being placed. Because there are several
pizzas on each order, the order number and customer information
appear in multiple rows. Several attributes of each record have null
values. A null value is an attribute with no data in it. For example,
Order Number has four null values. Therefore, this table is not in first
normal form. The data drawn from that form is shown in Figure 5.13.
FIGURE 5.13 Raw data gathered from orders at the pizza
shop.
In our example, ORDER, CUSTOMER, and PIZZA are entities. The
first step in normalization is to determine the functional dependencies
among the attributes. The functional dependencies in our example are
shown in Figure 5.14.

FIGURE 5.14 Functional dependencies in pizza shop


example.
In the normalization process, we will proceed from nonnormalized
data, to first normal form, to second normal form, and then to third
normal form. (There are additional normal forms, but they are beyond
the scope of this book.)
Figure 5.15 demonstrates the data in first normal form. The
attributes under consideration are listed in one table and primary keys
have been established. Our primary keys are Order Number, Customer
ID, and Pizza Code. In first normal form, each ORDER has to repeat
the order number, order date, customer first name, customer last
name, customer address, and customer zip code. This data file
contains repeating groups and describes multiple entities. That is, this
relation has data redundancy, a lack of data integrity, and the flat file
would be difficult to use in various applications that the pizza shop
might need.

FIGURE 5.15 First normal form for data from pizza shop.
Consider the table in Figure 5.15, and notice the very first column
(labeled Order Number). This column contains multiple entries for
each order—three rows for Order Number 1116 and three rows for
Order Number 1117. These multiple rows for an order are called
repeating groups. The table in Figure 5.15 also contains multiple
entities: ORDER, CUSTOMER, and PIZZA. Therefore, we move on to
second normal form.
To produce second normal form, we break the table in Figure 5.15 into
smaller tables to eliminate some of its data redundancy. Second
normal form does not allow partial functional dependencies. That is,
in a table in second normal form, every non-key attribute must be
functionally dependent on the entire primary key of that table. Figure
5.16 shows the data from the pizza shop in second normal form.
FIGURE 5.16 Second normal form for data from pizza
shop.
If you examine Figure 5.16, you will see that second normal form has
not eliminated all the data redundancy. For example, each Order
Number is duplicated three times, as are all customer data. In third
normal form, non-key attributes are not allowed to define other non-
key attributes. That is, third normal form does not allow transitive
dependencies in which one non-key attribute is functionally
dependent on another. In our example, customer information depends
both on Customer ID and Order Number. Figure 5.17 shows the data
from the pizza shop in third normal form. Third normal form structure
has the following important points:
It is completely free of data redundancy.
All foreign keys appear where needed to link related tables.
FIGURE 5.17 Third normal form for data from pizza shop.
Let’s look at the primary and foreign keys for the tables in third
normal form:
The ORDER relation: The primary key is Order Number and the
foreign key is Customer ID.
The CUSTOMER relation: The primary key is Customer ID.
The PIZZA relation: The primary key is Pizza Code.
The ORDER–PIZZA relation: The primary key is a composite key,
consisting of two foreign keys, Order Number and Pizza Code.
Now consider an order at the pizza shop. The tables in third normal
form can produce the order in the following manner by using the join
operation (see Figure 5.18). The join operation combines records
from two or more tables in a database to obtain information that is
located in different tables. In our example, the join operation
combines records from the four normalized tables to produce an
ORDER. Here is how the join operation works:
The ORDER relation provides the Order Number (the primary
key), Order Date, and Total Price.
The primary key of the ORDER relation (Order Number) provides
a link to the ORDER–PIZZA relation (the link numbered 1 in
Figure 5.18).
The ORDER–PIZZA relation supplies the Quantity to ORDER.
The primary key of the ORDER–PIZZA relation is a composite
key that consists of Order Number and Pizza Code. Therefore, the
Pizza Code component of the primary key provides a link to the
PIZZA relation (the link numbered 2 in Figure 5.18).
The PIZZA relation supplies the Pizza Name and Price to ORDER.
The Customer ID in ORDER (a foreign key) provides a link to the
CUSTOMER relation (the link numbered 3 in Figure 5.18).
The CUSTOMER relation supplies the Customer FName,
Customer LName, Customer Address, and Zip Code to ORDER.

FIGURE 5.18 The join process with the tables of third


normal form to produce an order.
At the end of this join process, we have a complete ORDER.
Normalization is beneficial when maintaining databases over a period
of time. One example is the likelihood of having to change the price of
each pizza. If the pizza shop increases the price of the Meat Feast from
$12.00 to $12.50, this process is one easy step in Figure 5.18. The price
field is changed to $12.50, and the ORDER is automatically updated
with the current value of the price.

Before you go on …
1. What is structured query language?
2. What is query by example?
3. What is an entity? An attribute? A relationship?
4. Describe one-to-one, one-to-many, and many-to-many
relationships.
5. What is the purpose of normalization?
6. Why do we need the join operation?

What’s in IT for me?


For the Accounting Major

The accounting function is intimately concerned with keeping


track of the transactions and internal controls of an organization.
Modern databases enable accountants to perform these functions
more effectively. Databases help accountants manage the flood of
data in today’s organizations so that they can keep their firms in
compliance with the standards imposed by Sarbanes–Oxley.
Accountants also play a role in cost justifying the creation of a
knowledge base and then auditing its cost-effectiveness. Also, if
you work for a large CPA company that provides management
services or sells knowledge, you will most likely use some of your
company’s best practices that are stored in a knowledge base.
For the Finance Major
Financial managers make extensive use of computerized databases
that are external to the organization, such as CompuStat or Dow
Jones, to obtain financial data on organizations in their industry.
They can use these data to determine if their organization meets
industry benchmarks in return on investment, cash management,
and other financial ratios.
Financial managers, who produce the organization’s financial
status reports, are also closely involved with Sarbanes–Oxley.
Databases help these managers comply with the law’s standards.

For the Marketing Major


Databases help marketing managers access data from their
organization’s marketing transactions, such as customer
purchases, to plan targeted marketing campaigns and to evaluate
the success of previous campaigns. Knowledge about customers
can make the difference between success and failure. In many
databases and knowledge bases, the vast majority of information
and knowledge concerns customers, products, sales, and
marketing. Marketing managers regularly use an organization’s
knowledge base, and they often participate in its creation.

For the Production/Operations Management Major


Production/operations personnel access organizational data to
determine optimum inventory levels for parts in a production
process. Past production data enable production/operations
management (POM) personnel to determine the optimum
configuration for assembly lines. Firms also collect quality data
that inform them not only about the quality of finished products
but also about quality issues with incoming raw materials,
production irregularities, shipping and logistics, and after-sale use
and maintenance of the product.
Knowledge management is extremely important for running
complex operations. The accumulated knowledge regarding
scheduling, logistics, maintenance, and other functions is very
valuable. Innovative ideas are necessary for improving operations
and can be supported by knowledge management.

For the Human Resources Management Major


Organizations keep extensive data on employees, including gender,
age, race, current and past job descriptions, and performance
evaluations. HR personnel access these data to provide reports to
government agencies regarding compliance with federal equal
opportunity guidelines. HR managers also use these data to
evaluate hiring practices, evaluate salary structures, and manage
any discrimination grievances or lawsuits brought against the firm.
Databases help HR managers provide assistance to all employees
as companies turn over more and more decisions about healthcare
and retirement planning to the employees themselves. The
employees can use the databases for help in selecting the optimal
mix among these critical choices.
HR managers also need to use a knowledge base frequently to find
out how past cases were handled. Consistency in how employees
are treated is not only important but it also protects the company
against legal actions. Training for building, maintaining, and using
the knowledge system is also sometimes the responsibility of the
HR department. Finally, the HR department might be responsible
for compensating employees who contribute their knowledge to the
knowledge base.

For the MIS Major


The MIS function manages the organization’s data as well as the
databases. MIS database administrators standardize data names
by using the data dictionary. This process ensures that all users
understand which data are in the database. Database personnel
also help users access needed data and generate reports with query
tools.

What’s in IT for me? (Appendix: Section 5.6)


For All Business Majors

All business majors will have to manage data in their professional


work. One way to manage data is through the use of databases and
database management systems. First, it is likely that you will need
to obtain information from your organization’s databases. You will
probably use structured query language to obtain this information.
Second, as your organization plans and designs its databases, it
will most likely use entity–relationship diagrams. You will provide
much of the input to these ER diagrams. For example, you will
describe the entities that you use in your work, the attributes of
those entities, and the relationships among them. You will also
help database designers as they normalize database tables, by
describing how the normalized tables relate to each other (e.g.,
through the use of primary and foreign keys). Finally, you will help
database designers as they plan their join operations to give you
the information that you need when that information is stored in
multiple tables.

Summary
5.1 Discuss ways that common challenges in managing data can be
addressed using data governance.
The following are three common challenges in managing data:
Data are scattered throughout organizations and are collected by
many individuals using various methods and devices. These data
are frequently stored in numerous servers and locations and in
different computing systems, databases, formats, and human and
computer languages.
Data come from multiple sources.
Information systems that support particular business processes
impose unique requirements on data, which results in repetition
and conflicts across an organization.
One strategy for implementing data governance is master data
management. Master data management provides companies with the
ability to store, maintain, exchange, and synchronize a consistent,
accurate, and timely “single version of the truth” for the company’s
core master data. Master data management manages data gathered
from across an organization, manages data from multiple sources, and
manages data across business processes within an organization.
5.2 Discuss the advantages and disadvantages of relational databases.
Relational databases enable people to compare information quickly by
row or column. Users also can easily retrieve items by finding the
point of intersection of a particular row and column. However, large-
scale relational databases can be composed of numerous interrelated
tables, making the overall design complex, with slow search and access
times.
5.3 Define Big Data and its basic characteristics.
Big Data is composed of high-volume, high-velocity, and high-variety
information assets that require new forms of processing to enhance
decision making, lead to insights, and optimize business processes. Big
Data has three distinct characteristics that distinguish it from
traditional data: volume, velocity, and variety.
Volume: Big Data consists of vast quantities of data.
Velocity: Big Data flows into an organization at incredible speeds.
Variety: Big Data includes diverse data in differing formats.
5.4 Explain the elements necessary to successfully implement and
maintain data warehouses.
To successfully implement and maintain a data warehouse, an
organization must do the following:
Link source systems that provide data to the warehouse or mart.
Prepare the necessary data for the data warehouse using data
integration technology and processes.
Decide on an appropriate architecture for storing data in the data
warehouse or data mart.
Select the tools and applications for the variety of organizational
users.
Establish appropriate metadata, data quality, and governance
processes to ensure that the data warehouse or mart meets its
purposes.
5.5 Describe the benefits and challenges of implementing knowledge
management systems in organizations.
Organizations can realize many benefits with KMSs, including the
following:
Best practices readily available to a wide range of employees
Improved customer service
More efficient product development
Improved employee morale and retention
Challenges to implementing KMSs include the following:
Employees must be willing to share their personal tacit
knowledge.
Organizations must create a knowledge management culture that
rewards employees who add their expertise to the knowledge
base.
The knowledge base must be continually maintained and updated.
Companies must be willing to invest in the resources needed to
carry out these operations.
5.6 Understand the processes of querying a relational database,
entity-relationship modeling, and normalization and joins.
The most commonly performed database operation is requesting
information. Structured query language is the most popular query
language used for this operation. SQL allows people to perform
complicated searches by using relatively simple statements or key
words. Typical key words are SELECT (to specify a desired attribute),
FROM (to specify the table to be used), and WHERE (to specify
conditions to apply in the query).
Another way to find information in a database is to use query by
example. In QBE, the user fills out a grid or template—also known as a
form—to construct a sample or a description of the data desired. Users
can construct a query quickly and easily by using drag-and-drop
features in a DBMS such as Microsoft Access. Conducting queries in
this manner is simpler than keying in SQL commands.
Designers plan and create databases through the process of entity–
relationship modeling, using an entity–relationship diagram. ER
diagrams consist of entities, attributes, and relationships. Entities are
pictured in boxes, and relationships are represented as diamonds. The
attributes for each entity are listed, and the primary key is underlined.
ER modeling is valuable because it allows database designers to
communicate with users throughout the organization to ensure that all
entities and the relationships among the entities are represented. This
process underscores the importance of taking all users into account
when designing organizational databases. Notice that all entities and
relationships in our example are labeled in terms that users can
understand.
Normalization is a method for analyzing and reducing a relational
database to its most streamlined form to ensure minimum
redundancy, maximum data integrity, and optimal processing
performance. When data are normalized, attributes in each table
depend only on the primary key.
The join operation combines records from two or more tables in a
database to produce information that is located in different tables.
Chapter Glossary
attribute Each characteristic or quality of a particular entity.
best practices The most effective and efficient ways to accomplish
business processes.
Big Data A collection of data so large and complex that it is difficult
to manage using traditional database management systems.
binary relationship A relationship that exists when two entities are
associated.
bit A binary digit—that is, a 0 or a 1.
business rules Precise descriptions of policies, procedures, or
principles in any organization that stores and uses data to generate
information.
byte A group of eight bits that represents a single character.
clickstream data Data collected about user behavior and browsing
patterns by monitoring users’ activities when they visit a website.
connectivity Describes the classification of a relationship: one-to-
one, one-to-many, or many-to-many.
database management system (DBMS) The software program
(or group of programs) that provides access to a database.
data dictionary A collection of definitions of data elements; data
characteristics that use the data elements; and the individuals,
business functions, applications, and reports that use these data
elements.
data file (also table) A collection of logically related records.
data governance An approach to managing information across an
entire organization.
data lake A central repository that stores all of an organization’s data,
regardless of their source or format.
data mart A low-cost, scaled-down version of a data warehouse that
is designed for the end-user needs in a strategic business unit (SBU) or
a department.
data model A diagram that represents entities in the database and
their relationships.
data warehouse A repository of historical data that are organized by
subject to support decision makers in the organization.
entity Any person, place, thing, or event of interest to a user.
entity–relationship (ER) diagram Document that shows data
entities and attributes and relationships among them.
entity–relationship (ER) modeling The process of designing a
database by organizing data entities to be used and identifying the
relationships among them.
explicit knowledge The more objective, rational, and technical
types of knowledge.
field A characteristic of interest that describes an entity.
foreign key A field (or group of fields) in one table that uniquely
identifies a row (or record) of another table.
instance Each row in a relational table, which is a specific, unique
representation of the entity.
intellectual capital (or intellectual assets) Other terms for
knowledge.
join operation A database operation that combines records from two
or more tables in a database.
knowledge management (KM) A process that helps organizations
identify, select, organize, disseminate, transfer, and apply information
and expertise that are part of the organization’s memory and that
typically reside within the organization in an unstructured manner.
knowledge management systems (KMSs) Information
technologies used to systematize, enhance, and expedite intra- and
interfirm knowledge management.
master data A set of core data, such as customer, product, employee,
vendor, geographic location, and so on, that spans an enterprise’s
information systems.
master data management A process that provides companies with
the ability to store, maintain, exchange, and synchronize a consistent,
accurate, and timely “single version of the truth” for the company’s
core master data.
multidimensional structure Storage of data in more than two
dimensions; a common representation is the data cube.
normalization A method for analyzing and reducing a relational
database to its most streamlined form to ensure minimum
redundancy, maximum data integrity, and optimal processing
performance.
primary key A field (or attribute) of a record that uniquely identifies
that record so that it can be retrieved, updated, and sorted.
query by example (QBE) To obtain information from a relational
database, a user fills out a grid or template—also known as a form—to
construct a sample or a description of the data desired.
record A grouping of logically related fields.
relational database model Data model based on the simple
concept of tables in order to capitalize on characteristics of rows and
columns of data.
relationships Operators that illustrate an association between two
entities.
secondary key A field that has some identifying information, but
typically does not uniquely identify a record with complete accuracy.
structured data Highly organized data in fixed fields in a data
repository such as a relational database that must be defined in terms
of field name and type (e.g., alphanumeric, numeric, and currency).
structured query language (SQL) The most popular query
language for requesting information from a relational database.
table A grouping of logically related records.
tacit knowledge The cumulative store of subjective or experiential
learning, which is highly personal and hard to formalize.
ternary relationship A relationship that exists when three entities
are associated.
transactional data Data generated and captured by operational
systems that describe the business’s activities, or transactions.
unary relationship A relationship that exists when an association is
maintained within a single entity.
unstructured data Data that do not reside in a traditional relational
database.

Discussion Questions
1. Is Big Data really a problem on its own, or are the use, control, and
security of the data the true problems? Provide specific examples to
support your answer.
2. What are the implications of having incorrect data points in your
Big Data? What are the implications of incorrect or duplicated
customer data? How valuable are decisions that are based on faulty
information derived from incorrect data?
3. Explain the difficulties involved in managing data.
4. What are the problems associated with poor-quality data?
5. What is master data management? What does it have to do with
high-quality data?
6. Explain why master data management is so important in companies
that have multiple data sources.
7. Describe the advantages and disadvantages of relational databases.
8. Explain why it is important to capture and manage knowledge.
9. Compare and contrast tacit knowledge and explicit knowledge.
10. Draw the entity–relationship diagram for a company that has
departments and employees. In this company, a department must
have at least one employee, and company employees may work in only
one department.
11. Draw the entity–relationship diagram for library patrons and the
process of checking out books.
12. You are working at a doctor’s office. You gather data on the
following entities: PATIENT, PHYSICIAN, PATIENT DIAGNOSIS,
and TREATMENT. Develop a table for the entity, PATIENT VISIT.
Decide on the primary keys and/or foreign keys that you want to use
for each entity.

Problem-Solving Activities
1. Access various employment websites (e.g., www.monster.com
and www.dice.com) and find several job descriptions for a database
administrator. Are the job descriptions similar? What are the salaries
offered in these positions?
2. Access the websites of several real estate companies. Find the sites
that do the following: take you through a step-by-step process for
buying a home, provide virtual reality tours of homes in your price
range (say, $200,000 to $250,000) and location, provide mortgage
and interest rate calculators, and offer financing for your home. Do the
sites require that you register to access their services? Can you request
that an e-mail be sent to you when properties in which you might be
interested become available? How do the processes outlined influence
your likelihood of selecting a particular company for your real estate
purchase?
3. It is possible to find many websites that provide demographic
information. Access several of these sites and see what they offer. Do
the sites differ in the types of demographic information they offer? If
so, how? Do the sites require a fee for the information they offer?
Would demographic information be useful to you if you wanted to
start a new business? If so, how and why?
4. Search the Web for uses of Big Data in homeland security.
Specifically, read about the spying by the U.S. National Security
Agency (NSA). What role did technology and Big Data play in this
questionable practice?
5. Search the web for the article “Why Big Data and Privacy Are Often
at Odds.” What points does this article present concerning the delicate
balance between shared data and customer privacy?
6. Access the websites of IBM (www.ibm.com) and Oracle
(www.oracle.com), and trace the capabilities of their latest data
management products, including Web connections.
7. Enter the website of the Gartner Group (www.gartner.com).
Examine the company’s research studies pertaining to data
management. Prepare a report on the state of the art of data
management.
8. Diagram a knowledge management system cycle for a fictional
company that sells customized T-shirts to students.
Chapter Closing Case
Data Enhances Fans’ Experience of the Tour de
France

Many sports, such as football, cricket, tennis, and baseball, provide


huge amounts of data to fans in addition to the video feeds
themselves. These data increase their enjoyment and improve their
understanding of various sporting events. Significantly, the data
can become overwhelming. Some fans still prefer to simply watch
sporting events.
In 1903, the French sports newspaper L’Auto established the Tour
de France as a marketing tool to increase sales. Although the Tour
de France is world famous, it is also a challenge to watch.
Television viewers can see the racers from numerous camera
angles, but it is difficult to understand the subtleties and tactics of
elite professional bicyclists. In addition, the race lasts for three
weeks. Compounding these issues, the Tour’s followers are more
digitally engaged on social media than ever before, and they want a
live and compelling experience during the Tour. This improved
experience requires data.
Today, the Tour is deploying technology to capture data and
perform analytics in near real time to help engage a new
generation of digital fans. The 2018 Tour de France provided more
data to fans than ever before, explaining and analyzing the
performance of the racers in detail for the duration of the race.
Unlike other sports, which typically occur in a single venue, the
Tour de France presents a unique set of challenges for using
technology to capture and distribute data about the race and its
cyclists. Information technology services firm Dimension Data
(Dimension; www.dimensiondata.com) has provided and
managed the technology behind the Tour since the 2015 race.
Dimension has to gather and transmit data as the cyclists travel
over difficult terrain such as up and down mountain roads in the
Pyrenees and the Alps, where wireless signals can be intermittent
and weak. The manner in which Dimension handles these
conditions provides an excellent example of how to deal with high-
volume, real-time data from Internet of Things (IoT; see Chapter
8) sensors in often harsh conditions.
Each of the 176 riders who began the 2018 race had a custom, 100-
gram (3.5-ounce) sensor on his bicycle. The sensor contained a
global positioning system (GPS) chip, a radio frequency
identification (RFID) chip, and a rechargeable battery with enough
power to last for the longest of the Tour’s 21 stages. Each sensor
transmitted its GPS data every second, producing a total of more
than 3 billion data points during the course of the race, a vast
increase in data from the 128 million data points generated in the
2016 Tour de France. Dimension also collected data from other
sources such as weather services and road gradients (steepness of
climbs).
The data were entered into new machine learning algorithms (see
Technology Guide 4), which predicted breakaway speeds, time
gaps among riders, and even the winner of each stage. Dimension
achieved approximately a 75 percent success rate on all its
predictions for the 2018 Tour.
The bicycles’ sensors transmitted data over a mesh network (see
Chapter 8) via antennas placed on race cars that followed the
cyclists, up to television helicopters overhead. The helicopters then
sent the data to an aircraft at higher altitude, which in turn
transmitted the data to the television trucks at the finish line of
each stage of the race. There, the data were gathered by
Dimension’s “Big Data truck,” which was also located at the finish
line of each stage. Dimension notes that the transmission distances
and latency (lag time) associated with satellites are too slow for the
Tour.
Dimension’s cloud computing service (see Technology Guide 3),
based in data centers in London and Amsterdam, collected data
from each stage of the race. At the data centers, Dimension
produced the near-real-time information that was transmitted to
broadcasters, social media, and the race app. The entire process
from bike to viewer took a total of two seconds.
Dimension integrated and analyzed these data. As a result, race
organizers, teams, broadcasters, commentators, television viewers,
and fans using the Tour de France mobile app had access to in-
depth statistics on the progress of the race and their favorite riders.
As recently as 2014, the only way for riders to obtain real-time
information (e.g., road and weather conditions ahead, accidents
that may have occurred, positions of other riders and teams)
during the Tour de France was from a chalkboard that was held up
by race officials who sat as passengers on motorcycles driving just
ahead of the cyclists. Today, the riders wear earpiece radios so
their teams can relay real-time data (often from Dimension) to
them while they cycle. In that way, riders do not have to take their
eyes off the road—and other riders—to look for chalkboards
containing information for them.
Fans have enjoyed the insights that they have gained. For example,
Dimension displayed an image of a high-speed crash during the
2015 Tour that contained data on the speed and sudden
deceleration of the bikes involved. The data suggested that some
riders had somehow accelerated after the moment of impact.
Further analysis revealed that the bikes themselves, with their
sensors still attached, had greater forward momentum after their
riders were thrown off, because the weight on each bike had
decreased. Some bikes involved in that crash seemingly accelerated
to race speeds soon after the crash. In fact, they had been picked
up by team cars following the peloton (the main group) of cyclists
and were driven away.
In the 2018 Tour, data suggested that some riders achieved top
speeds of over 60 miles per hour. Some fans were doubtful until
one rider tweeted a photo of his speedometer showing that his
speed peaked at more than 62 miles per hour along one downhill
section of the race.
However, technology can be wrong. For example, GPS data can be
corrupted by signal noise during transmission. As a result, during
the 2015 Tour, the GPS data indicated that one cyclist was in
Kenya. Today, all data are checked for these types of errors before
they are processed.
Today, the technology is opening up the Tour to new and old fans.
For example, in 2014, video clips distributed by race organizers
attracted 6 million views. By the 2018 race, that number had
grown to more than 70 million.
Eurosport (www.eurosport.com) is a European television
sports network that is owned and operated by Discovery
Communications. The network reported that average viewer
numbers for live coverage of the 2018 Tour de France increased by
15 percent over the 2017 Tour.
Dimension is developing new services for the 2019 Tour. First is a
new predictive cybersecurity protection model because the race is
subject to a huge number of online attacks. For instance,
Dimension was able to block almost two million suspicious access
attempts during the 2017 Tour. Second, Dimension is developing
an augmented-reality app that would allow viewers to experience,
for example, an Alpine climb through a projection on a mobile app.
Third, Dimension is developing dedicated chatbots that could
provide real-time information on any specific rider at any point of
the race.

Sources: Compiled from “Technology, Data, and Connected Cycling


Teams at the Tour de France: What Can Businesses Learn from the World’s
Biggest Race?” TechRadar Pro, July 20, 2018; M. Moore, “Tour de France
2018: Why This Year’s Race Will Be the Smartest Yet,” TechRadar, July 6,
2018; “How Big Data, IoT, and the Cloud Are Transforming the Tour de
France Fan Experience,” NTT Innovation Institute, August 10, 2017; M.
Smith, “How Sophisticated High-Tech Analytics Transformed the Tour de
France,” t3.com, July 31, 2017; M. Phillips, “19 Weird Tech Secrets of the
2017 Tour de France,” Bicycling, July 21, 2017; B. Glick, “Data Takes to the
Road – The Technology Behind the Tour de France,” Computer Weekly,
July 20, 2017; G. Scott, “Six Tech Trends from the 2017 Tour de France,”
Road Cycling UK, July 18, 2017; R. Guinness, “Bike Development Is Vital
but Technology Alone Won’t Win Tour de France,” ESPN, July 8, 2017;

You might also like