Data Science Unit 1 Notes
Data Science Unit 1 Notes
Data Science Unit 1 Notes
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing – Basic
Statistical descriptions of Data
INTRODUCTION
Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains.
Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from structured and unstructured data.
Data science and big data are used almost everywhere in both commercial and
noncommercial settings.
Commercial companies in almost every industry use data science and big data to gain
insights into their customers, processes, staff, completion, and products. Many companies use
data science to offer customers a better user experience, as well as to cross-sell, up-sell, and
personalize their offerings. A good example of this is Google AdSense, which collects data from
internet users so relevant commercial messages can be matched to the person browsing the
internet.
Financial institutions use data science to predict stock markets, determine the risk of
lending money, and learn how to attract new clients for their services.
Governmental organizations are also aware of data’s value. It can help in extracting
useful information and knowledge from large volume of data in order to improve government
decision making.
Nongovernmental organizations (NGOs) are also no strangers to using data. They use
it to raise money and defend their causes. The World Wildlife Fund (WWF), for instance,
employs data scientists to increase the effectiveness of their fundraising efforts. Many data
scientists devote part of their time to helping NGOs, because NGOs often lack the resources to
collect data and employ data scientists. DataKind is one such data scientist group that devotes its
time to the benefit of mankind.
Universities use data science in their research but also to enhance the study experience of
their students. The rise of massive open online courses (MOOC) produces a lot of data, which
allows universities to study how this type of learning can complement traditional classes.
MOOCs are an invaluable asset if you want to become a data scientist and big data professional,
so definitely look at a few of the better-known ones: Coursera, Udacity, and edX. The big data
and data science landscape changes quickly, and MOOCs allow you to stay up to date by
following courses from top universities. If you aren’t acquainted with them yet, take time to do
so now; you’ll come to love them as we have.
FACETS OF DATA
In data science and big data you’ll come across many different types of data, and each of them
tends to require different tools and techniques. The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Let’s explore all these interesting data types.
Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record.
As such, it’s often easy to store structured data in tables within databases or Excel files (figure
1.1). SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases. You may also come across structured data that might give you a hard time
storing it in a traditional relational database. Hierarchical data such as a family tree is one such
example.
The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines. More often, data comes unstructured.
1 The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project
this will result in a project charter.
2 The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and transformation before it becomes
usable.
3 Now that you have the raw data, it’s time to prepare it. This includes transforming the data
from a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from different data sources, and
transform it. If you have successfully completed this step, you can progress to data visualization
and modeling.
4 The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
5 Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make the predictions
stated in your project charter. Now is the time to bring out the heavy guns, but remember
research has taught us that often (but not always) a combination of simple models tends to
outperform one complicated model. If you’ve done this phase right, you’re almost done.
6 The last step of the data science model is presenting your results and automating the analysis, if
needed. One goal of a project is to change a process and/or make better decisions. You may still
need to convince the business that your findings will indeed change the business process as
expected. This is where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects require you to perform
the business process over and over again, so automating the project will save time.
In reality you won’t progress in a linear way from step 1 to step 6. Often you’ll regress and
iterate between the different phases.
OUTLIERS
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations. The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on
the upper side when a normal distribution is expected. The normal distribution, or Gaussian
distribution, is the most common distribution in natural sciences.
Distribution plots are helpful in detecting outliers and helping you understand the variable.
APPENDING TABLES
Appending or stacking tables is effectively adding observations from one table to another table.
One table contains the observations from the month January and the second table contains
observations from the month February. The result of appending these tables is a larger one with
the observations from January as well as February. The equivalent operation in set theory would
be the union, and this is also the command in SQL, the common language of relational databases.
Other set operators are also used in data science, such as set difference and intersection.
Transforming data
Certain models require their data to be in a certain shape. Now that you’ve cleansed and
integrated the data, this is the next task you’ll perform: transforming your data so it takes a
suitable form for data modeling.
Bar chart
Line chart
Distribution chart
STEP 5: BUILD THE MODELS
Using machine learning and statistical techniques to achieve your project goal.
With clean data in place and a good understanding of the content, you’re ready to build models
with the goal of making better predictions, classifying objects, or gaining an understanding of the
system that you’re modeling. The techniques you’ll use now are borrowed from the field of
machine learning, data mining, and/or statistics.
Building a model is an iterative process. The way you build your model depends on whether you
go with classic statistics or the somewhat more recent machine learning school, and the type of
technique you want to use. Either way, most models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison.
Model and variable selection
You’ll need to select the variables you want to include in your model and a modeling technique.
Your findings from the exploratory analysis should already give a fair idea of what variables will
help you construct a good model. Many modeling techniques are available, and choosing the
right model for a problem requires judgment on yourpart. You’ll need to consider model
performance and whether your project meets all the requirements to use your model, as well as
other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
■ Does the model need to be easy to explain? When the thinking is done, it’s time for action.
Model execution
Once you’ve chosen a model you’ll need to implement it in code.
STEP 6: PRESENTING FINDINGS AND BUILDING APPLICATIONS ON TOP OF
THEM
After you’ve successfully analyzed the data and built a well-performing model, you’re ready to
present your findings to the world. This is an exciting part; all your hours of hard work have paid
off and you can explain what you found to the stakeholders.
Presenting your results to the stakeholders and industrializing your analysis process for repetitive
reuse and integration with other tools.
WHAT IS DATA MINING?
Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for further use. The key properties of data mining are
Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases
o Automated prediction of trends and behaviors. Data mining automates the process of
finding predictive information in large databases. Questions that traditionally required
extensive hands-on analysis can now be answered directly from the data — quickly. A
typical example of a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize return on
investment in future mailings. Other predictive problems include forecasting bankruptcy
and other forms of default, and identifying segments of a population likely to respond
similarly to given events.
o Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
o Regression – attempts to find a function which models the data with the least error.
o Summarization – providing a more compact representation of the data set, including
visualization and report generation.
A typical data mining system may have the following major components.
Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies used to
organize attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may
also be included. Other examples of domain knowledge are additional interestingness constraints
or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis, and evolution analysis.
User interface:
This module communicates between users and the data mining system, allowing the user
to interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.
DATA WAREHOUSE:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product. Time-Variant: Historical data is kept in a
data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or
even older data from a data warehouse. This contrasts with a transactions system, where often
only the most recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated with a
customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Data Warehouse Design Process: A data warehouse can be built using a top-down approach, a
bottom-up approach, or a combination of both.
Tier-1:
The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g.,
to merge similar data from different sources into a unified format), as well as load and refresh
functions to update the data warehouse . The data are extracted using application program
interfaces known as gateways. A gateway is DEPT OF CSE & IT VSSUT, Burla supported by
the underlying DBMS and allows client programs to generate SQL code to be executed at a
server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data warehouse
and its contents.
Tier-2: The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS that maps operations on multidimensional data to
standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.
Tier-3: The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
BASIC STATISTICAL DESCRIPTION OF DATA
Descriptive Statistics
Inferential Statistics
Statistics also provides tools—a variety of tests and estimates—for generalizing beyond
collections of actual observations. This more advanced area is known as inferential statistics.
for example:
An assertion about the relationship between job satisfaction and overall happiness.
MODE
The mode reflects the value of the most frequently occurring score.
Set of numbers that appears most often.
Ex: Determining the mode of following retirement ages: 60,63,45,63,65,70,55,63,60,65,63(Here,
63 occurs the most.)
MEDIAN
The median reflects the middle value when observations are ordered from least to
most.
The median splits a set of ordered observations into two equal parts, the upper and
lower halves.
MEAN
The mean is the most common average, one you have doubtless calculated many times.
The mean is found by adding all scores and then dividing by the number of scores.
That is,
It’s usually more efficient to substitute symbols for words in statistical formulas,
including the word formula given above for the mean. When symbols are used, X designates the
sample mean, and the formula becomes
and reads: “X-bar equals the sum of the variable X divided by the sample size n.” [Note that the
uppercase Greek letter sigma (Σ) is read as the sum of, not as sigma.
The formula for the population mean differs from that for the sample mean only because
of a change in some symbols. In statistics, Greek symbols usually describe population
characteristics, such as the population mean, while English letters usually describe sample
characteristics, such as the sample mean. The population mean is represented by μ (pronounced
“mu”), the lowercase Greek letter m for mean,
where the uppercase letter N refers to the population size. Otherwise, the calculations are the
same as those for the sample mean.
Sample question: All 57 residents in nursing home were surveyed to see how many times a day
that eat meals.
1 meal (2 people)
2 meal (7 people)
3 meal (28 people)
4 meal (12 people)
5 meal (8 people)
What is population mean for the number of meals eaten per day?
=(1x2)+(2x7)+(3x28)+(4x12)+(5x8)/57
=2+14+84+48+40/57
=188/57
=3.29 (Approx 3.3)
The Population mean is 3.3
WHICH AVERAGE
If Distribution Is Not Skewed
When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.
If Distribution Is Skewed
Positively skewed distribution
In positively skewed distribution, the mean is greater than Median and skewed in Positive
direction (Right side)
.
Negatively skewed distribution
In Negatively skewed distribution, the mean is greater than Median and skewed in
Negative direction (Left side).