FDSA

FundamentelsofDatn Soienceand Analytikcs (1-2) tntroduction to Data
Sclence
1.1 Data Science
• Data is measurabie units of information gathered or captured from activity of people,

and things places
• Data science is an interdisciplinary field that seeks to extract knowiedge or insights fro
various forms of data. At its core, Data Science aims tô iscover and extract actionahl.
knowledge from data that can be used to make sound business
decisions and predictions.
• Data science combines math and statistics, specialized programming, advanced analytics.
Artificial Intelligence (AI)and machine learning with specific subjectmatter expertise to
uncover actionableinsights hidden in an
organization's data.
• Data science uses advanced analytical theory and various methods such as time series
analysis for predicting From
historical data, Instead ofknowing how many
future.
products
Sold in previous quarter, data sciencehelps in forecasting
future product sales and revenue
more accurately.
• Data science is devoted to the extraction of clean information from raw data to form
actionable insights. Data science practitioners apply machine learning algorithms to
numbers, text, images, video, audio and more to produce artificial intelligence systems to
perform tasks that ordinarily require
human intelligence.
The data science field is growing rapidly and revolutionizing so many industries. It has
incalculable benefits in business, research and our everyday lives.
• As a general rule, data
scientists are skilled in detecting patterns hidden
within large
volumes of data and they often use advanced
algorithms and implement machinelearning
modeis to help businesses and organizations make accurate
assessmentsand predictions.
Data science and big data evolved. from and
statistics traditional data management but are
now consideredto be distinct disciplines.
• Life cycle of data science:
1. Capture :Data acquisition, data entry, signal recepton and data extraction.
2. Maintain : Data warehousing, data cleansing, data staging. data processing and dat
architecture.
3. Process: Data mining, clustering and classification, data modeling and dat
summarization.
4. Analyze :Data reporting. data visualization, business intelligence and decisson making.
5.Communicate : Exploratory and cogfirmatory analysis, predictive analysis. regression
mining and qualitative analysis.

text
TECHICAL PUBLICA TIONS-an up-thrust for knowledge

Fundamentals of Data Science and Analytics (1-3) Introduction to Data Science
1.1.1 Big Data

Big data can be defined as very large volumes of data available at various sources. in
varying degrees of complexity, generated at different speed i.e. velocities and varying
degrees of ambiguity, which cannot be processed using
traditional technologies, processing
methods, algorithms or any commercial off-the-shelfsolutions.
•'Big data' is a term used to describe collection of data that is huge in size and yet growing
exDonentially with time. In short, such a data is so large and complex that none of the
traditional data management tools are able to store it or process it efficiently.
1.1.2 Characteristics of Big Data
• Characteristics of big data are volume,velocity and variety. They are often referred to as
the three V's.
1. Volume : Volumes of data are larger than that conventional relational database
infrastructure can cope with. It consisting ofterabytes or petabytes of data.
2. Velocity :The term 'velocity' refers to the speed of generation of data. How fast the data
is generated and processed to meet the demands,determines real potential in the data. It
isbeing created in or near real-time.
3.Variety : It refers to heterogeneous sources and the nature of data, both structured and
unstructured.
These three dimensions are also called as three V's of Big Data.
Volume Velocity Variety
Structured 1. Batch
1. Records 1.
2. Stream
2. Pictures 2. Semi-structured
3. Unstructured 3.Real time processing

3. Videos
4. Terabyte
of big data is veracity and value.

Two other characteristics
Veracity:
infbrmation credibility and content validity.

Veracity refers to sourcereliability;
manager rely on the fact that
iness of the data. Can the
Veracity refers to the trustwort
that there are inherent
the data is representative ? Every good manager knows
discrepancies in all the data collected.
an up-thrust for knowiedge

TECHNICAL PUBLICATIONS
undamentals ofData
Sclenceand Anayties (t
• Spatial
quality
veracity
varies.
: For vector data (imagery based
on whether the points have been GPS deteming
It depends
n points lines and
polygons te
detemined by unknown origins or manually. Alsa, resolutienand projectien issue
alter veracity.
• For geo-coded points, there may be errors in the address tables and in the point leca
algorithms associatedwith addresses.
• For raster data (imagery based on pixels). veracity depends on accuracy et rorne
instruments in satellites or aerial devices and on timeliness.
b) Value :
• It represents the business value to be derived from big data.

The ultimate objective of any big data project should betogenerate some sort of ve
for the company doing all the analysis. Otherwise, user just pertorming
technologicaltask for technology's sake.
For real-time spatial big data, decisions can be enhance through visualhzation of
dynamic change in such spatial phenomena as climate, trattic, socil-meta-basi
attitudes and massive inventory locations.
Exploration of data trends can include spatial proxumitiesand relationships.

• Once spatial big data are struetured.
spatial
fomal spatial analytics can be applied suck s
autocorrelation, overlays, buttering. spatial cluster techniques and locann
quotients.
1.1.3 Difference between Data Science and Big Data
Sr. No. Data Seience

Big Data
1
It is a fieldof scientific analysisof data Bgdata is steng nd processin
in order tosolve analytically complex large vollume ot structured and
problems and the significant and unsructured data hat cannot be
necessary activity of cleansing, posstblewith raàtional appitcatons
preparing of data.
2. It is used in Biotech, energy, gaming Used in retal,eucation,healther

and insurance.
and social mea
3. Goals
detection,
:
Data classification, anomaly Goals :To provde beter custoner
prediction, scoring and service,identifng new enue
ranking.
opportunities, eftecttve markeng ett
4 Tools mainlyused in Data Science

Tools mostly used in Big Data
include SAS,R, Python, etc.
include Hadoop, Spark. Flina. ete
TECHNICAL PUBLICA TIONS- an up-thrust for knowlecge

a14 comparison
between Cloud Computing
and Big Data
Sr.No,
Cloud
Computing
1
Big Data
It provides resources on
demand. It
provides a way to handle
huge volumes
of dataland
generate insights.
2. It refers to internet
services from It
refers to data, which
SaaS, PaaS can be structured,
to laaS.
semi-structured or unstructured.
3. Cloud is used to store data
and It is used to describe huge volume of
informationon remote data
servers. and information.
4.
Cloud Computing iseconomical
as it Big data is highly scalable, robust
has low maintenance
costs,
ccosystem and cost-effective.
centralized platform, no upfront cost
and disaster safe implementation.
5. Vendors and solution providers of Vendorsand solution providers of big

Cloud Computing are Google, data are Cloudera, Hortonworks,Apache
Amazon Web Service, Del1, and MapR.
Microsoft, Apple and IBM.
6 The main focus of cloud computing Main focus of big data is about solving
isto provide computerresources andproblems when a huge amount of data
services with the help of network generating and processing.
connection.
1.1.5 Benefits and Uses of Data Science
Data science example and applications :

Anomaly detection : Fraud, disease and crime
a)
b) Classification :Background checks;an email server classifying emails as "important"
) Forecasting :Sales, revenueand

customerfetention
market patterns
)Pattern detection :Weather patterns, financial
and text
e) Recognition :Facial, voice
Recommendation :
Based on earned preferences, recommendation engines
and books
user to movies, restaurants
) Regression? Predicting food delivery times.predicting home

prices based on amenities
) Optimization : Scheduling ride-sshare pickupsand package
deliveries
TECHNICAL PUBLICATIONS - an up-thrust for knowledge

to Data Science
Introduction
(1- 6)
ofData Soience
andAnalytics
FUndamentals
and Use of Big Data

1.1.6 Benefits
Benefits ofBig Data :
1. Improved customer service

while taking decisions
outside intelligence
2. Businesses can utilize
.3. Reducing maintenance costs
5.
Re-develop our
products :Big Data
products so that we
Early
can also help us
can adapt them or
identification of risk
understand
our marketing, if need be.
to the product/services,if any

how others perceive
our
6. Better operational efficiency
Some ofthe examples of big data are:

tothe flood of we
1. Social media :
Social me dia is one of the
500
biggest contributors
-terabvtes of data everyday in the

data
form of
have today. Facebook generates around
by the users like status messages. photos nd video uploads,

content generated
messages, comments etc.
by stock exchanges also in terabytes per day. Most

2. Stock exchange :Data generated is
of this data is the trade data of users and companies.
3. Aviation industry :A single jet engine can generate around 10 terabytes of ata during
a 30 minute flight.
4. Survey data :Online or offline surveys conducted on various topics which typically has
hundreds and thousands of responses and peeds to be processed' for analysis and
visualization by creating a cluster of population and their associated resnonses
5. Compliance data :Many organizations like halthcare, hospitals. life sciences, finance
etc has to file compliance reports.
1.2 Facets of Data
Ver large amountof data will generate in big ata and data seience. These data is various
of data are as follows

:
rypes and main categories
b) Unstructured
a)Structured
t
d) Machine-generated
c) Natura! language
e) Graph-based
) Audio, video and images
g) Streaming un
for knowedge
#HNCA FLBLCATICNS. an up-thrust
and (1-7) Introduction to Data Science
of Data Science Analytics
nAmentais
Data
1.2.1 Structured
data is arrangedin rows and column format. It helps for application to retrieve
Structurced
data easily. Database management system is used for storing structured data.
and proccss
The term structured data refcrs to data that is identifiable because is organized in a it
eure. The most common form of structured data or records is a database where specific
is stored based on a methodology of columns and rows.

Enfornation
Structured
dataa is also searchable by data type within content. Structured data is understood
and is also efficiently organized for human readers.

by computers
an example of structured data,

An Excel table is
Unstructured Data
1.2.2
Tinstructured data is data that does not follow a specified format. Row and columns are not
nsed for unstructured data. Therefore it is difficult to retrieve required information
Unstructureddata has no identifiable structure.
The unstructured data can be in the form of Text : (Documents, email messages, custome
feedbacks),audio, video, images. Email is an example of unstructured data.
Even today in most of the organizations more than 80 % of the data are in unstructure
form. This carries lots of information. But extracting information from these vario
sourcesis a very big challenge.
Characteristics of unstructured data :
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4.There are no predefined formats, restriction or sequence for unstructured data.
5.Since there is no structuralbinding for unstructured data, it is unpredictable in nature
1.2.3 Natural Language
Natural language is a special type of unstructured data.
processing enables machines to recognize characters, words

Nalural language
Sentences,then apply meaning and understanding to that information. This helps mac
tounderstand language as humans do.

Fundamentals of Data Science and Analytics (1-8) Introduction to Data Science
• Natural language processing is the drivíng force behind machine intelligence in many
modern real-world applications. The natural language processing community has had
success in entity recognition, topic recognition, summarization, text completion and

sentiment analysis.
For natural language processing to help machines understand human language, it mustgo
through speech recognition, natural language understanding and machine translation. It is
an iterative process comprised ofseveral layers oftext analysis.
1.2.4 Machine- Generated Data

an information that is created without human interaction as a
• Machine-generateddata is
This means that data entered manually

resuit of a computer process or application activity.
by an end-user is not recognized to be machine-generated.
definitive record of all activity and behavior of our customers,

Machine data contains a
networks, factory machineryand so on.

users, transactions, applications, servers,
events, the output of
• It's configuration data, data from APIs and message queues, change
from remote equipmentand more.
commands and call detail records, sensor data
diagnostic
records, networkevent logs and
web server logs, call detail
• Examples of machine data are
telemetry.
Human-to-Machine(H2M) interactions generate
Both Machine-to-Machine (M2M) and
system, as
data is generated continuously by every processor-based
machine data. Machine
systens.
well as many consumer-oriented
of machine data has
In recent years, the increase
can be either structured or unstructured.
It as well as cloud
servers and desktops,
of mobile devices, virtual
surged. The expansion more complex.
technologies, is
making IT infrastructures
and RFID
based services
or Network Data
1.2.5 Graph-based between entities in
relationships
and interactions
structures to describe nodes and
Graphs are data
a collection
of entities called
a graph contains
In general,
complex systems.
edges.
a pair of nodes called
ofinteractions between our problem
anothercollection that is relevant
to
of any object type
entities, which can be (network)
of nodes.
Nodes represent will end up
with a graph
edges, we
nodes with Data is
domain. By connecting instead of tables

or documents.
relationships without
database stores
nodes and Our data is stored
A graph whiteboard.
ideas on a about and
stored just like we might sketch way of thinking
flexible
allowing a very
restricting it to a predefined model,
using it.
for knowledge
an up-thrust
ECHNICAL PUBLICATIONS.
Introductlon to Data Solence
pdamentels of Data Science and (1-9)
Analytics
graph-based data and are queried with specialized query

Graph databases are used to store
languages such as SPARQL.

fraud prevention.With graph databases, we
Graph databases are capable of sophisticated
in near-real time. With
can use relationships to process financial and purchase transactions
we able to detect that,for example, a potential purchaser is using the

fast graph queries, are
sanne email address and credit card as included in a known fraud case.
user easily detect relationship patterns such as multiple

Graph databases can also help
same IP
people associated with a personal email address or multiple people sharing the
address but residing in different physicaladdresses.
good choice forrecommendation applications. With graph databases,

Graph databases are a
we can store in a graph relationships between information categories such as customer
interests, friends and purchase history. We can use a highly available graph database to
make product recommendations to a user based on which products are purchased by others
who follow the same sport and have similar purchase history.
network analysis in the early history of

Graph theory is probably the main method in social
the social network concept. The approach is applied to social network analysis in order to
such as the nodes and links (for example

determine important features of the network
influencersand the followers).
network have been identified as users that have imnpact on the

Influencers on social
way of followership or influence on decision made by

activities or opinion of other users by
otherusers on the network as shown in Fig. 1.2.1.
Followers
Influencer
Flg. :
1.2.1 Influencer
Indroducton to Data Scisncs
Fundarmentels ofDale Solence andAnalytics (t-10)
Fig. 1.2.2 :Graph on 5 vertices

network
datasets such as social
be very effective on large-scale
Graph theory has proved to of an,actual visual
the building
it is capable of by-passing
data. This is because
data matrices.
of the data to run directly on
representation
1.2.6 Audio, Image and Video

t0 a data scientist.
pose specic challenges
image and video are data types that
Audio,
objeets in pictures,
tuTn outto be
such as recognizing
for humans.
Tasks that are trivial
challenging for comnputers.
format for
time-based media storage
• The terms audio and video commonly
pictures information.
refers to the
Audio and video digital

recording, also
sound/music and moving compressed or lossy

video codecs. can be uncompressed. lossless
referred as audio and
use cases.
on the desired quality and
compressed depending
sources of
is one of the most important
remark that multimedia data
. It is important to
and indexing of multimedia
the integration. transfornmation
information and knowledge:
data management and
analysis, Mny challengeshave
i
challenges
bring significant
data
naure of Data Science and
big data. multidisciplinar
to be addressed including
heterogeneity.
in multimediadata.
address these challenges
. Data Science is playing an important role
to
forms of media, such as text.

image. video,
Multimedia data usually contains various
even pulse waveforms, which

come from multiple sources
geographic coordinates and
and data mining
covering big data. machine leaning
Data Science can be a key instrument
data.
solutions to store, handle and analyze such heterogeneous
1.2.7 Streaming Data
generated continuously by thousands

of data sources,whic
Strcaming data is data that is
records simultaneously and in small sizes(orderof

Kilobytes).
typscally send in the data
TECHNICAL PUBLICATIONS-anup-thrust for knowiedge

Fundaentalsof Deta Soience and Analytics (1-11) Introduction to Deta Sciehce
Streaming data includes a wide variety of data such as log files generated by customers
using your mobile or web applications, ccommerce purchases, in-game player activity,
infomation from social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrunmentation in data centers.
1.2.8 Difference between Structured and Unstructured Data
Sr. No. Parameters Structured data Unstructured data
Representation It is in discrete form ie. Unstructured data is data
stored in tw and column that does not follow a
format specified format
2 Meta data Syntax Semantics
3 Storage Database management Unmanaged file structure
system
4. Standard SQL,ADO.net, ODBC Open XML,SMTO.

SMS
5. Integration tool ETL Batch processing or
manual data entry
6 Characteristics With a structure document, In unstructured document
certain information always informationcan appear in
appears in the sane unexpected places on the

location on the page. document.
7. Used by organizations Low volume operations High volume operations
L 1.3 Data Science Process
• Data scienceprocess consists of six stages:
1. Discovery or setting the rescarch goal
2. Retrievingdata
3.Data preparation
4. Data exploration
5. Data mnodeling
6. Presentation and automation
TECHNICAL PUBLICATIONS an up-thrust for knowiedgo

Introduction to
Fundamentals of Data Scence and Analytics (1 - 12) pata
Sclence
• Fig. 1.3.1 shows data science design process.

Fun
Defining research goais
Retrieving data
Data preparation
Exploratory data analysis
Build the model
Presenting findings and

building applications
Fig. 1.3.1 :Data science design process

Step 1: Discovery
or defining research goal
This step involves intemal and
acquiring data from al! the identified external SOurcesS,
which heips to answer the business

questi On.
Step 2: Retrieving data
It collection of data which required for project. This the processof gaining
is a business
understanding of the data user have and deciphering what each piece of data means. This
could entaildetermining exactiy what data is required and the best methods for obtaining
it.
This alsoentails determining what each of the data points means in terms of the company
If we have given a data set from a client, for exampie, we shall need to know what eac
column and row represents.
• Step 3:Data preparation
Data can have many inconsistencieslike missing values, blank coiumns. an incorrect da
format,which needs to be cieaned. We need to process, explore and condition data bef
modeling. The cleandata, gives the better predictions.
• Step 4:Data exploration
Data exploration is related to deeper understanding of data. Try to understand

variables interactwith each other,the distribution of the data and whether there are out!
Toachieve this use descriptive statistics, visual techniques and simple modeling. This
is also called as exploratory data analysis.

ECHNCAL PUBLICATIONS -an up-thrust for knowiedge
Furndamentals of Date Scienceand Analytics (1- 13) Introduction to Data Science
Step 5:Data modeling
In this step, the actual model building process starts. Here, Data scientist distributes
datasets for training and testing. Techniques like association, classification and clustering
areappliedto the training data set. The model, once prepared, is tested againstthe "testing"
dataset.
• Step 6 :Presentation and automation

Deliver the final baselined model with reports, code and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing. In this
stage, the key findingsare communicated to all stakeholders. This helps to decide if the
project results are a success or a failure based on the inputsfromn the model.
1.4 Defining Research Goals

• To understand the project, three concept must understand : What,why and how.
a) What is expectation of company or organization?
b) Why does a company's higher authority define such research value ?

c) How isit part of a bigger strategic picture ?
Goal of first phase will be the answer of these three questions.
In this phase, the data science team must learn and investigate the problem, develop context
and understandingand learm about the data sourcesneeded and available forthe project.
1. Learningthe business domain:
• Understanding the domain area of the problem is essential. In many cases, data scientists
willhave deep computational and quantitative knowiedge that can be broadly applied
across many disciplines.
Data scientists have deep knowiedge of the methods, techniques and ways for applying
heuristics toa variety of businessand conceptualproblems.
2. Resources:
• As part of the discovery phase, the team neèds to assess the resources available to
support the project. In this context, resources nclude technology. tools, systems, data
and peopie.
3.Frame the problem :
Framing is the processof stating the analytics problem to be solved. At this point, itis &
best practice to write down the problem statementand share it with the key stakeholders.
TECHNICAL PUBLICATIONS e uphruat tor knowtege

Introductionto Data
(1- 14) Science
Fundamentals of Data Science and Analytics
related to the needs
different things and the
hear slightly
• Each team member may
ideas of possible solutions.
different
have somewhat
problem and
4. ldentifying key stakeholders:

and stakeholders. which
key risks should
criteria,
the success
The team can identify be significantly impacted
project or
will bythe
from the
•
benefit
anyone who
will
include
project. area and any relevant

learn aboutthe domain history
stakeholders,
When interviewing
projects.
from similar analytics
5. Interviewing
the analytics sponsor:
to clarify and frame .
with the stakeholders
plan to collaborate
The team should
analytics problem. that may ma:
solution
a predetermined
sponsors may have
• At the outset, project
outcome.
necessarily realize the desired the tie
and expertise to identify
knowledge
the team must use
its
• In these cases.
solution.
and appropriate
underlying problenm to take time to
thoroughly
the team needs
the main stakeholders,
When interviewing
the project or
providing
tends to be the one funding

project sponsor.who
intervievw the
the high-level requirements.

potential working
has an idea of a
problem and usually
• This person understands the
solution.
initial hypotheses :
6. Developing it is best to
test with data. Generally,

that the team can
Thisstep involves forming ideas about developing
hypotheses to test and then be creative
with a few primary
come up
several more.
team will use in later
tests the
the basis of the analytical
• These Initial Hypotheses form
the findings in phase.
the foundation for
phases and serveas
datasources:
. ldentifying potential
needed to test the hypotheses.
of the data
type and time span
Consider the volume. most ue
In cases,
simply aggregated data.
can access more than
Ensure that the team analysIS.
bias for the downstream
to avoid introducing
team will need theraw data
knowledge
TECHNICAL PUBLICA TIONS- an up-thrust for
Fundanentals of Data Science and Analytics (1 15) Introduction to Data Science
L 1.5 Retrieving Data
Retrieving required data is second phase of data science project. Sometimes Data scientists
need to go into the field and design a data collection process. Many companies will have
already collected and stored the data and what they don't have can often be bought from
third parties.
Most of the high quality data is freely available for public and commercial use. Data can be
stored in various format. in text
It is file formatand tables in database. Data may be internal
or exteImal.
1. Startworking on internal data, i.e. data stored within the company
• First step of data scientists is to verify the internal data.

Assess the
relevance and quality
of the data that's readily in company. Most companies have a program for
maintaining
key data, so much of the cleaning work may already be done. Thisdata can be
stored in
official data repositories such as databases, data marts, data warehouses and data lakes
maintained by a team of IT professionals.
Data repository is also known as a data libraryor data archive. This is a general term
to refer to a data set isolated to be mined for data reporting and analysis. The data
repository is a large database infrastructure, several databases that collect, manage and
store data sets for data analysis, sharing
and reporting.
Data repository can be used to describe several ways to collect and store data :
a) Data warehouse is a large data repository that aggregates data usually from multipl
sOurces or segments of a business, withoutthe data being necessarily related.
b) Data lake is a large data repository that stores unstructured data that is classified an
tagged with metadata.
c) Data marts are subsets of the data repository. These data marts are more targeted
t
what the data user needs and easier to use.
d) Metadata repositories store data about data and databases. The metadata explain
where the data source,how it was capturedand what it represents.
e) Data cubes are lists of data with three or more dimensions stored as a table.
Advantages of data repositories:
i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data reporting.
ii. Database administratorshave casier time tracking problems.
iv. There is value to storing and analyzing data.
TECHNICAL PUBLICA TIONS - an up-thrust for knowledge

FundamentalsofData Scienceand Analytics (1- 16) Introduction to Data Science
Disadvantages ofdata repositories :
i. Growing data sets could slow down systems.
L A system crash could affect all the data.
data more easily than if it was distributed

t.Unauthorized users can access all sensitive
acrossseverallocations.
2. Do not be afraid to shop around

help of other company,
within the company, take the
t required data is not available
and GFK are provides data
such types of database. For example, Nielsen
Whichprovides
Twitter, LinkedIn and Facebook.
Data scientists also take help of
tor retail industry.
world. This data can be of
share their data for free with the
• Government's organizations
that creates and manages it. The

excellent quality; it depends on the institution
range of topics such as the number

ofaccidents or
informationthey share covers a broad
region and demographics.
amount ofdrug abuse in a certain its
problem
3. Perform data quality checks to avoid later
and data cleaning. Collecting suitable.

• Allocate or spend some time for data correction
eror free data is success of the data science project.
Most of the errors encounter during the data gathering phase are easy to spot, but being
spend many hours solving data issues that could

too careless willmake data scientists
. have been prevented during data import.
Data scientists must investigate the data during the import, data preparation and
exploratoryphases. The differenceis in the goal and the depth ofthe investigation.
• In data retrieval process,verifywhether the data is right data type and data is same as in
the source document.
With data preparation process, more elaborate checks performed. Check any shortcut
method is used. For example, check time and data format.
• During the exploratory phase, Data scientists focus shifts to what he/she can learm from
the data. Now Data scientists assume the data to be clean and look at the statistical
propertiessuch as distributions, correlationsand outliers.

Fundamentals of Data Science and Aneiytics (1- 17) Introductionto DataSclence
A 1.6 Data Preparation
• Data preparation means data cleansing, Integrating and transforming data.
1.6.1 Data Cleaning
• Data is cleansed through processes such as filling in missing values,

smoothing the no1sy
data or resolving the inconsistencies in the data.
Data cleaning tasks are as follows:
1. Data acquisition and metadata 2. Fill in missing values

3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data 6. Correct inconsistent data
Data cleaning is a first step in data pre-processing

techniques which is used to find the
missing value, smooth noise data, recognize outliers and correct inconsistent.
Missing value :These dirty data will affects on miming procedure and led to unreliable and
poor output. Therefore it is important for some data cleaning routines. For example,
suppose that the average salary of staff is ? 65000/-. Use this value to replace the missing
value for salary.
• Data entry errors : Data collection and data entry are error-prone
processes. They often
require human and because humans are only human, they make
intervention
typos or lose
their concentration for a second and introduce an error into the chain. But data
collected by
machines or computers isn't free from errors either. Errors
can arise from human
sloppiness,whereas others are due to machine or hardware failure. Examples of errors
originating from machines are transmission errors or bugs in the extract, transform and load
phase (ETL).
• Whitespace error : Whitespaces tend to be hard to detect but cause errors like other
redundant characters would. Toremove the spaces present at start and end of the string, we
can use strip() functionon the string in Python.
• Fixing capital letter mismatches : Capital letter mismatches are common problem. Most
programming languages make a distinction between "Chennai" and "chennai".
• Python provides string conversion like to convert a string to lowercase, uppercase using
lower(),upper().
TECHNICAL PUBLICATIONS-an up-thrust for knowledge

(1- 18) iotroduction to Data
Fundamentals of Data Sciernce and Analytics Science
• The lower ) function in python converts the input string to lowercase. The upper( )
function in python converts the input string to uppercase.
A6.2 Outliat
and subsequentlyexcluding from a
• Outlier detection is the process of detecting outliers
is to use a table with the

gIven set of data. The easiest way to find outliers a plot or
minimum and maximum values.
• Fig. l.6.1 showsoutliers detection. Here O, and O, seem outliers from the rest.
Fig. 1.6.1 :Outliers detection
An outlier may be defined as a piece of data or observation that deviates drastically from
the given norm or average of the data set. An outlier may be caused simply by chance, but
it may also indicate measurement error or that the given data set has a heavy-tailed
distribution.
Outlier analysis and detection has varous applications in numerous fields such as fraud
detection, credit card, discovering computer and

intrusion criminal behaviours,medical and
public health outlier detection, industrial damage detection.
• General idea of application is to find out data which deviates from nomal behaviour of data
Set.
1.6.3 Dealing with Missing Value
These dirty data will affects on miming procedure and led tounreliable and poor output.
Therefore it isimportant for some data cleaning routines.
TECHNICAL PUBLICATIONS-anup-thrust for knowfedge

Fundamentats of Data SGience and Anelytics (1-19) Introduction to Data Science
How to handle noisy data in data mining ?

• Following methods are used for handling noisy data :
1. Ignore the tuple : Usually done when the class label is missing. This method is not
good unless the tuple contains several attributes
with missing values.
2. Fill in the missing value manually : It is time-consumingand not suitable for a large
data set with many missing values.
3. Use a global constant to fill in the missing value :Replace all missing
attribute values
by the same constant.
4. Use the attribute mean to fill in the missing value :For example, suppose that the
average salary of staff is ? 65000/-. Use this value to replace themissing value for
salary.
5. Use the attribute mean forall samples belongingto the sameclass as the giventuple.
6. Use the most probable value to in the
fill missingvalue.
1.6.4 Correct Errors as Early as Possible

• If error is not corrected in early stage of project, then it create problem in latter stages. Most
of the time, we spend on finding and correcting error. Retrieving data is a difficult task and
organizations spend millions of dollars on in the hope of making
it better decisions. The
data collection process is errorprone and in a big organization it involves many steps and
teams.
• Data should be cleansed when acquiredfor many reasons:

a) Not everyone spots the data anomalies. Decision-makers may make costly mistakes on
information based on incorrect data from applications that fail to correct for the faulty
data.
b) Iferrors are not corrected early on in the process, the cleansing will have to be done for
every project that uses that data.
c)Data erTOrs may pointto a businessprocess that isn't working as designed.
d) Data erors may point to defective equipment, such as broken transmission lines and
defectivesensors.
e) Data errors can point to bugs in software or in the integration of software that may be
critical to thecompany.
TECHNICAL PUBLICATIONS- an up-thrust for knowiedge

introduction to Data
Fundementals ofData Science andAnalytics (1- 20) Sclence
Data Sources
1.6.5 Combining Data from Different
1. Joining table
of one observation found in

user to combine the information One
Joining tables allows
The focus is on enriching
we find
in another table. a
table with the information that
single observation.
within a table. This

means that
•A primary key is a value that cannot be duplicated

key column. Ihat same key can exist ae
within the primary
value can only be seen once
creates the relationship. A foreign key can have
a foreign key in another table which
duplicate instances within a table.
on and Country Name keys.

• Fig. l.6.2 showvs joining two tables the CountryID
Date Units CountrylD Country Name

CountrylD
100 1001 India

10/10/2021 1001
21/10/2021 3001 50 3001 UK
31/10/2021 4001 75 4001 USA
01/10/2021 3001 3001 Spain
Date CountryD Units CountryName

10/10/2021 1001 100 India
21/10/2021 3001 50 USA

31/10/2021 4001 75 Spain
01/10/2021 3001 90 USA
Fig. 1.6.2: Joining two tables
2. Appending tables
Appending table is called stacking table. It eftectively adding observations from one
table to another table. Fig. 1.6.3
shows Appending table. (See Fig. 1.6.3 on next page)
TECHNICAL PUBLCATIONS an up-thrust for knowledge

ndemertas of Data Science and Analytics (1- 21) Introduction to Data Science
Table 1
Table 2
x3 x3
11 k 33
2 b 3 12 33
3 3 13 m 33
4 3 14 33
e 3 15 33
Table 3
2 x3
1
3
2 b
3 3
4 3
5 3
wwww
11 k 33
12 1
33
13 m 33
14 33
15 33
Fig. 1.6.3: Appending table
• Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The result of

appending these tables is a larger one with the observations from Table 1 as well as
Table 2. The equivalent operation in set theory would be the union and this is also the
command in SQL,the common language of relational databases. Other set operators are
also used in data science, such as setdifference and intersection.
Using views to simulate data joins and appends
Duplication of data is avoided by using view and append. The append table requires
more space for storage. If table size is in terabytes of data, then it becomes problematic
to duplicate the data. For this reason, the concept of a view was invented.
Introduction to Data Sciencó
andAnalytics (1-22)
Fundamentals of Date Science
combined virually into
months is
from the different
shows how the sales data
• Fig. 1.6.4
the data.
instead of duplicating
a yearly sales table Dec Sales
Feb Sales Date
Jan Sales ob
ob Date 1-Dec
Date Physical tables
1-Feb
1 1-Dec
1Jan
1-Feb
2 1-Jan
31-Dec
28-Feb
31-Jan
Yearly sales
ob Date
1-Jan view
Virtual table
1
2 1-Jan
31-Dec
Fig. 1.6.4 : View
1.6.6 Transforming Data

for
into forms appropriate
transformed or consolidated
the data are
• In data transformation,
always
and an output variable aren't
mining. Relationships between an input variable
linear.
number of variables : Having too many variables in the mnodel makes the
Reducing the
techniques don't perform well when user overload

model difficult to handle and certain
them with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data
scientists use special methods to reduce the number of variables but retain the maximum
amount of data.
Euclidean distance :
Euclidean distance is used tomeasure the similarity between observations.It is calculated
as the square root of the sum of differences between each point.
Euclidean distance =
of Data Science and
nsamentas Analytics (1-23) Introduction to Data Science
urning variable into dummies :

Variables can be turned into dummy variables. Dummy variables canonly take twO values :
we (1) or false (0). They're used to indicate the absence of acategorical effect that may
exnlain the observation.
Custonmer Sales Date Gender

100 Jan-21 M
3 20 Dec-20 F
2 400 May-22 F
1 500 Jan-22 M
10 45 M
Aug-21
7 300 Dec-21
250 July-22 F
Customer Sales Date Male Female

1 100 Jan-21 1
500 Jan-22 1
2 400 May-22 1
3 20 Dec-20 1
7 300 Dec-21 1
250 July-22
10 45 Aug-21 1
1.7 Exploratory Data Analysis
Exploratory Data Analysis (EDA)is a general approach to exploring datasets by means of
simple summary statistics and graphic visualizations in order to gain a deeper
understandingof data.
EDA is used by data scientists to analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods. It helps determine how
best to manipulate data sources to get the answers user need, making it easier for data
Scientists to discover patterns, spot anomalies, test a hypothesis or check assumptions.
TECHNICAL PUBLICATIONS-an up-thrust for knowledge

ionto
Data
• EDA is an approach / philosophy for data analysis that employs a variety of
Science
1. Maximize insight into a data set: 2. Uncover ftechniques

underlying structure: to:
4 Detect outliers and
3. Extract important variables: anomalies:
5. Test underlying assumptions; 6. Develop parsimonious

models: and
7. Determine optimal factor settings.
With EDA,following functions are performed :
1. Describe of user data
2. Closely explore data distributions
3. Understand the relations between variables
4. Notice unusual or unexpected situations
5. Place the data into groups
6. Notice unexpected patterns within groups
7. Take note of group differences.
excellent tool for conveying location and vanation infomation in d

Box plots are an
and location and variation changes
sets, particularly for detecting illustrating between
different groups of data.
• Exploratory data analysis is majorly performed

using the following methods :
summary statistics for each fieid in the raw data set (or)
1. Univariate analysis : Provides
summary only on one variable. Example CDFPDF,Box

plot. :
performed to find the relationship berween each variable in the
2. Bivariate analysis is
dataset and the target variable of interest (or) using two variables and finding
relationship between them. Example : Boxplot, Vioiinplot.
performed to understand interactions between different fields in

3. Multivariate analysis is
more than 2.
the dataset (or) finding interactions between variables
to visuallyshow
• A box plot is a type of chart often used in explanatory data analysis
the
of numerical data and skewness through displaying the data quartiles or

distribution
percentile and averages.

Eundarnentals ofData Science end Analytics (1- 25) Introduction to Data Science
LOwer aquartile dpper quartie

Median
Min Max
Whisker
Box
Interquartile |range (1QR)
Fig. 1.7.1
1. Minimum score :The lowest score, exlcuding
outliers.
2. Lower quartile: 25 % of scores fall below the lowerquartile value.
3. Median :The median marks the mid-point of the data and is shown by the line that
divides the box into two parts.
4. Upper quartile :75 % ofthe scores fall below the upperquartiel value.
5. Maximum score The highest : score, excluding outliers.
6. Whiskers :The upper and lower

whiskers represent scores outside the middle 50 o.
7. The interquartile :
range This is the box plot
of scores. showing the middle 50 %
Boxplotsare also extremely usefule for visually
checking group differences. Suppose we
have four groups of scores and we want to
compare them by teaching method. Teaching
method is our categorical grouping variable and score is
the continuous outcomes variable
that the researchers measured.
Boxplotof score
40
354
30
25
SCOre
204
154
101
Method 1 Method 2 Method 3 Method 4
Teaching Method
Fig. 1.7.2
TECHNICAL PUBLICATIONS.an up-thrust for knowedge

Fundamentals of Data Scienceand Analytics (1 - 26) introduetion t0
Data
1.8 Build the Modeis

understand the
To build the should be clean and content
model, data properly. The
:
components of model building areas follows
a) Sclection of model and variable
b) Execution of model
model comparison.
c) Model diagnosticand
consist of the following main
Most models steps
process.
• Building a model is an iterative
enter in the model
technique and variablesto
of amodeling
I. Selection
2. Execution of the model
•
3. Diagnosis
1.8.1
For this phase,

Model
touse model. as well

and model comparison.
and Variable Selection

consider
asother factors
model performance and
:
whether project meets all |
the
requirements
and. 11 So, would itbe casy to

be moved to a production environnent
1. Must the model
implement ?
2. How difficult isthe maintenance on the

model :How long Wil! it remain relevant if 1ot
untouched ?
to be easy to explain ?
3. Does the model need
• 1.8.2
Pvthon provides
Model Execution
Various programming language

libraries like
is used for implementing the
StatsModels or Scikit-learn.These
model. For model execution.
packages use several of
the most popular techrniques.

available can
Coding a model is a nontrivial task in most cases, so having these libraries
the remarks on output :

speed up the process. Following are
a) Model fit :R-squared or adjusted R-squared is used.
:
b) Predictor variables have a coefficient For a linear model this is easy to interpret.
c) Predictor significance : Coefficients are great, but sometimes not enough evidence
exists to show that the influence is there.
Linear regression works if we want to predict a value, but for classify something,
classification models are used. The k-nearestneighbors method is one of the best method.
ndementals of Data Scienceand Analytics
{1-27) Introduction to Data Science
Following commercial tools are used:
1. SAS enterprise miner This tool

hased on large volumes
: allows users to run predictive
and descriptive models
of data from across the enterprise.
2, SPSS modeler:It offers
methods to explore and analyze
3. Matlab :
Provides a high-level
language for performing a
data through a GUI.
algorithmsand data
variety of data analytiCs,
exploration.
4. Alpine miner:This tool provides a GUI front end
for users to develop
workflows and interact with Big Data tools analytic
and platformson the back
end.
Open Source tools :
1. R and PLR:PL/R isa procedural language

2. Octave :A free software
ofthe functionalityof
programming language for
for PostgreSQL with
R.
computational modeling, has some

Matlab.
3. WEKA : a free data mining
It is
software package with an
analytic workbench. The
functions created in WEKA can
be executed within Java code.
4. Python is a programming
language that provides toolkits for machine learning and
analysis.
5. SQL in-database implementations, such as MADlib provide an alterative to in memory

desktop analytical tools.
1.8.3 Model Diagnostics and Model Comparison

Iry to build multiple model and then select
best one based on multiple criteria.
Working
with a holdout sample helps user pick the
best-performing model.
In Holdout Method, the data is split into two different datasets labeled as a training and a
testing dataset. This can be a 60/40 or 70/30or 80/20 split. This technique is called the
hold-out validation technique.
Suppose we have a database with house prices as the dependent variable and two
independent variables showing the square footage of the house and the number of rooms.
Now, imagine this dataset has 30 rows. The whole idea is that you build a model that can
predict house prices accurately.
To'train'our model or see how well it performs,we randomly subset 20 of those rows and
fit themodel. The second step is topredict the values of those 10 rows that we excluded
and measure how well our predictions were.

introduction to Data
Fundamentals of Data ScenceendAnalytics (1-28) Science
% of the data into the

• As sample 80
Suggestto randomly
a rule of thumb, training
experts
set and 20 % into the test set.
The holdout method has two,basic drawbacks:
1. Itrequires extra dataset. of

estimate error rate will be
the holdout
2. It is a single train-and-test experiment,
split.
"unfortunate"
misleading if we happen to get an
Applications
1.9 Presenting Findings and Building
documents.
code and technical
briefings,
Theteam delivers final Ireports, in a product:.
the models
project to implement
may run a pilot
In addition, team
environment. will be most useful
is where user soft skills
data science process
The last stageof
the
your analysis process for
stakeholders
and industrializing
results to the
Presenting your
with other toois.
and integration
repetitive reuse
Questions with Answers

1.10 Two Marks
Q.1 What is data science ?
Ans. :
or insights from
field that seeks to extract knowledge
Data science is an interdisciplinary
various forms of data.

from data that
discover and extract actionable knowledge
core, data science aims to
At its
and predictions.
can be used to make sound business decisions
theory and various methods such as time series
analytical
Data science uses advanced
for predicting future.

analysis
Q.2 Define structured data.
Ans. : Structured data is aranged in rows and column format. It helps for application to
Database management system is used for storing structured

retrieve and process data easily.
data that is identifiable because it is organized in a

data. The term structured data refers to
structure.
Q.3 What is data ?

Ans. : Data set is collection of related records or information. The information may be on
Some entity or sorme subject area.
What is unstructured data ?
Unstructured data is data that does not
follow a specified format. Row and
Bsed for unstructured columns are
data. Therefore it is difficult to retrieve
required information.
actured data has no identifiable structure.
What is machine-generated data ?

L:Machine-generated data is an information that is
created without human interaction asa
dt of a computer process or application activity.This means that data entered manually by
nd-user is not recognized to be machine-generated.
Define streaming data.
.: Streaming data is data that is generated continuously by thousands of data sources,
ch typically send in the data records simultaneously
and in small sizes (orderof Kilobytes).
List the stages of data science process.
s.: Stages of data science process are as follows :
Discoveryor setting the research goal
•Retrieving data
Data preparation
• Data exploration
• Data modeling
• Presentation and automation.
8 What are the advantages of data repositories ?
: Advantages are as follows:
Is.
• Data is preserved and archived.

• Data allows for
isolation and easier faster data reporting.
Database administrators have easier time tracking problems.
• There value to
is and analyzing
storing data.
9 What is data cleaning?

and collecting necessa
inconsistent data or noise
1s.: Data cleaning means removing the
formation ofa collection ofinterrelated
data.
Q.10 What is outlier detection ?
Ans. : Outlier detection is the process of detecting and subsequently excluding outliers from a
given set of data. The easiest way to find outliers is to use a plot or a table with the minimum
and maximum values.
a.11 Explain exploratory data

analysis.
by means
Data Analysis (EDA)is a general approach to exploring datasets
nis ENploratory
a decper understanding
and graphic visualizations in order to gain
of stmple summary statistics
data sets and sunmarizetheir
Of data. EDA isused by data scientiststoanalvze and investigate
man characteristics, often employing data visualization methods.
Q.12 What is data cleaning ?

necessary
the inconsistent data or noise and collecting
ANS. Data cleaning means removing
information of a collection of interrelated data.
Q.13 List the stages of data science process.
Ans. : Data science process consists of six stages :
2. Retrieving data
1. Discovery or setting the research goal.
4. Data exploration
3. Data preparation
6. Presentation and automation.
5. Data modeling
a.14 What isdata repository?

term to
librarv or data archive. This is a general
Ans. : Datarepository is also known as a data
be mined for data reporting and analvsis. The data repository is a
efer to a data set isolated to
data
that collect. manage and store data sets for
arge database infrastructure, several databases
nalysis, sharing and reporting.
.15 List the data cleaning tasks ?

ns. :Data cleaning are as follows:
1. Data acquisition and metadata 2. Fill in missing values
4. Converting nominal to numeric

3. Unified date format
data 6. Correct inconsistent data.

5. Identify outliers and smooth out noisy
16 What is Euclidean distance ?

s. :Euclidean distance is used to measure the similarity between observations. It is
alculated asthe square root of the sum of differences between each point.

FDSA

Uploaded by

Copyright:

Available Formats

FDSA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FDSA

Uploaded by

Copyright:

Available Formats

FundamentelsofDatn Soienceand Analytikcs (1-2) tntroduction to Data

1.1 Data Science

• Data is measurabie units of information gathered or captured from activity of people,

• Life cycle of data science:

5.Communicate : Exploratory and cogfirmatory analysis, predictive analysis. regression

mining and qualitative analysis.

TECHICAL PUBLICA TIONS-an up-thrust for knowledge

1.1.1 Big Data

1.1.2 Characteristics of Big Data

infrastructure can cope with. It consisting ofterabytes or petabytes of data.

isbeing created in or near real-time.

Volume Velocity Variety

3. Unstructured 3.Real time processing

of big data is veracity and value.

infbrmation credibility and content validity.

discrepancies in all the data collected.

an up-thrust for knowiedge

• It represents the business value to be derived from big data.

Exploration of data trends can include spatial proxumitiesand relationships.

1.1.3 Difference between Data Science and Big Data

Sr. No. Data Seience

2. It is used in Biotech, energy, gaming Used in retal,eucation,healther

4 Tools mainlyused in Data Science

TECHNICAL PUBLICA TIONS- an up-thrust for knowlecge

5. Vendors and solution providers of Vendorsand solution providers of big

services with the help of network generating and processing.

1.1.5 Benefits and Uses of Data Science

Data science example and applications :

b) Classification :Background checks;an email server classifying emails as "important"

) Forecasting :Sales, revenueand

) Regression? Predicting food delivery times.predicting home

TECHNICAL PUBLICATIONS - an up-thrust for knowledge

and Use of Big Data

Benefits ofBig Data :

1. Improved customer service

.3. Reducing maintenance costs

can adapt them or

our marketing, if need be.

to the product/services,if any

6. Better operational efficiency

Some ofthe examples of big data are:

-terabvtes of data everyday in the

by the users like status messages. photos nd video uploads,

messages, comments etc.

by stock exchanges also in terabytes per day. Most

of this data is the trade data of users and companies.

etc has to file compliance reports.

1.2 Facets of Data

of data are as follows

is stored based on a methodology of columns and rows.

and is also efficiently organized for human readers.

an example of structured data,

nsed for unstructured data. Therefore it is difficult to retrieve required information

Unstructureddata has no identifiable structure.

feedbacks),audio, video, images. Email is an example of unstructured data.

sourcesis a very big challenge.

Characteristics of unstructured data :

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.