FDSA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

FundamentelsofDatn Soienceand Analytikcs (1-2) tntroduction to Data

Sclence

1.1 Data Science

• Data is measurabie units of information gathered or captured from activity of people,


and things places

• Data science is an interdisciplinary field that seeks to extract knowiedge or insights fro
various forms of data. At its core, Data Science aims tô iscover and extract actionahl.
knowledge from data that can be used to make sound business
decisions and predictions.

• Data science combines math and statistics, specialized programming, advanced analytics.
Artificial Intelligence (AI)and machine learning with specific subjectmatter expertise to
uncover actionableinsights hidden in an
organization's data.

• Data science uses advanced analytical theory and various methods such as time series
analysis for predicting From
historical data, Instead ofknowing how many
future.
products
Sold in previous quarter, data sciencehelps in forecasting
future product sales and revenue
more accurately.

• Data science is devoted to the extraction of clean information from raw data to form
actionable insights. Data science practitioners apply machine learning algorithms to
numbers, text, images, video, audio and more to produce artificial intelligence systems to
perform tasks that ordinarily require
human intelligence.

The data science field is growing rapidly and revolutionizing so many industries. It has
incalculable benefits in business, research and our everyday lives.
• As a general rule, data
scientists are skilled in detecting patterns hidden
within large
volumes of data and they often use advanced
algorithms and implement machinelearning
modeis to help businesses and organizations make accurate
assessmentsand predictions.
Data science and big data evolved. from and
statistics traditional data management but are
now consideredto be distinct disciplines.

• Life cycle of data science:

1. Capture :Data acquisition, data entry, signal recepton and data extraction.

2. Maintain : Data warehousing, data cleansing, data staging. data processing and dat
architecture.

3. Process: Data mining, clustering and classification, data modeling and dat

summarization.

4. Analyze :Data reporting. data visualization, business intelligence and decisson making.

5.Communicate : Exploratory and cogfirmatory analysis, predictive analysis. regression

mining and qualitative analysis.


text

TECHICAL PUBLICA TIONS-an up-thrust for knowledge


Fundamentals of Data Science and Analytics (1-3) Introduction to Data Science

1.1.1 Big Data


Big data can be defined as very large volumes of data available at various sources. in

varying degrees of complexity, generated at different speed i.e. velocities and varying
degrees of ambiguity, which cannot be processed using
traditional technologies, processing
methods, algorithms or any commercial off-the-shelfsolutions.

•'Big data' is a term used to describe collection of data that is huge in size and yet growing
exDonentially with time. In short, such a data is so large and complex that none of the
traditional data management tools are able to store it or process it efficiently.

1.1.2 Characteristics of Big Data

• Characteristics of big data are volume,velocity and variety. They are often referred to as
the three V's.

1. Volume : Volumes of data are larger than that conventional relational database

infrastructure can cope with. It consisting ofterabytes or petabytes of data.

2. Velocity :The term 'velocity' refers to the speed of generation of data. How fast the data

is generated and processed to meet the demands,determines real potential in the data. It

isbeing created in or near real-time.

3.Variety : It refers to heterogeneous sources and the nature of data, both structured and

unstructured.

These three dimensions are also called as three V's of Big Data.

Volume Velocity Variety

Structured 1. Batch
1. Records 1.

2. Stream
2. Pictures 2. Semi-structured

3. Unstructured 3.Real time processing


3. Videos

4. Terabyte

of big data is veracity and value.


Two other characteristics

Veracity:

infbrmation credibility and content validity.


Veracity refers to sourcereliability;
manager rely on the fact that
iness of the data. Can the
Veracity refers to the trustwort
that there are inherent
the data is representative ? Every good manager knows

discrepancies in all the data collected.

an up-thrust for knowiedge


TECHNICAL PUBLICATIONS
undamentals ofData
Sclenceand Anayties (t

• Spatial

quality
veracity

varies.
: For vector data (imagery based
on whether the points have been GPS deteming
It depends
n points lines and
polygons te
detemined by unknown origins or manually. Alsa, resolutienand projectien issue
alter veracity.

• For geo-coded points, there may be errors in the address tables and in the point leca
algorithms associatedwith addresses.

• For raster data (imagery based on pixels). veracity depends on accuracy et rorne
instruments in satellites or aerial devices and on timeliness.

b) Value :

• It represents the business value to be derived from big data.


The ultimate objective of any big data project should betogenerate some sort of ve
for the company doing all the analysis. Otherwise, user just pertorming
technologicaltask for technology's sake.

For real-time spatial big data, decisions can be enhance through visualhzation of
dynamic change in such spatial phenomena as climate, trattic, socil-meta-basi
attitudes and massive inventory locations.

Exploration of data trends can include spatial proxumitiesand relationships.


• Once spatial big data are struetured.
spatial
fomal spatial analytics can be applied suck s
autocorrelation, overlays, buttering. spatial cluster techniques and locann
quotients.

1.1.3 Difference between Data Science and Big Data

Sr. No. Data Seience


Big Data
1
It is a fieldof scientific analysisof data Bgdata is steng nd processin
in order tosolve analytically complex large vollume ot structured and
problems and the significant and unsructured data hat cannot be
necessary activity of cleansing, posstblewith raàtional appitcatons
preparing of data.

2. It is used in Biotech, energy, gaming Used in retal,eucation,healther


and insurance.
and social mea
3. Goals
detection,
:
Data classification, anomaly Goals :To provde beter custoner
prediction, scoring and service,identifng new enue
ranking.
opportunities, eftecttve markeng ett

4 Tools mainlyused in Data Science


Tools mostly used in Big Data
include SAS,R, Python, etc.
include Hadoop, Spark. Flina. ete

TECHNICAL PUBLICA TIONS- an up-thrust for knowlecge


a14 comparison
between Cloud Computing
and Big Data
Sr.No,
Cloud
Computing
1
Big Data
It provides resources on
demand. It
provides a way to handle
huge volumes
of dataland
generate insights.
2. It refers to internet
services from It
refers to data, which
SaaS, PaaS can be structured,
to laaS.
semi-structured or unstructured.
3. Cloud is used to store data
and It is used to describe huge volume of
informationon remote data
servers. and information.
4.
Cloud Computing iseconomical
as it Big data is highly scalable, robust
has low maintenance
costs,
ccosystem and cost-effective.
centralized platform, no upfront cost
and disaster safe implementation.

5. Vendors and solution providers of Vendorsand solution providers of big


Cloud Computing are Google, data are Cloudera, Hortonworks,Apache
Amazon Web Service, Del1, and MapR.
Microsoft, Apple and IBM.
6 The main focus of cloud computing Main focus of big data is about solving
isto provide computerresources andproblems when a huge amount of data

services with the help of network generating and processing.

connection.

1.1.5 Benefits and Uses of Data Science

Data science example and applications :


Anomaly detection : Fraud, disease and crime
a)

b) Classification :Background checks;an email server classifying emails as "important"

) Forecasting :Sales, revenueand


customerfetention

market patterns
)Pattern detection :Weather patterns, financial

and text
e) Recognition :Facial, voice

Recommendation :
Based on earned preferences, recommendation engines
and books
user to movies, restaurants

) Regression? Predicting food delivery times.predicting home


prices based on amenities
) Optimization : Scheduling ride-sshare pickupsand package
deliveries

TECHNICAL PUBLICATIONS - an up-thrust for knowledge


to Data Science
Introduction

(1- 6)
ofData Soience
andAnalytics
FUndamentals

and Use of Big Data


1.1.6 Benefits

Benefits ofBig Data :

1. Improved customer service


while taking decisions
outside intelligence
2. Businesses can utilize

.3. Reducing maintenance costs

5.
Re-develop our
products :Big Data

products so that we

Early
can also help us

can adapt them or

identification of risk
understand

our marketing, if need be.

to the product/services,if any


how others perceive
our

6. Better operational efficiency

Some ofthe examples of big data are:


tothe flood of we
1. Social media :
Social me dia is one of the
500
biggest contributors

-terabvtes of data everyday in the


data

form of
have today. Facebook generates around

by the users like status messages. photos nd video uploads,


content generated

messages, comments etc.

by stock exchanges also in terabytes per day. Most


2. Stock exchange :Data generated is

of this data is the trade data of users and companies.

3. Aviation industry :A single jet engine can generate around 10 terabytes of ata during

a 30 minute flight.

4. Survey data :Online or offline surveys conducted on various topics which typically has

hundreds and thousands of responses and peeds to be processed' for analysis and
visualization by creating a cluster of population and their associated resnonses

5. Compliance data :Many organizations like halthcare, hospitals. life sciences, finance

etc has to file compliance reports.

1.2 Facets of Data

Ver large amountof data will generate in big ata and data seience. These data is various

of data are as follows


:
rypes and main categories
b) Unstructured
a)Structured
t
d) Machine-generated
c) Natura! language

e) Graph-based
) Audio, video and images

g) Streaming un

for knowedge
#HNCA FLBLCATICNS. an up-thrust
and (1-7) Introduction to Data Science
of Data Science Analytics

nAmentais

Data
1.2.1 Structured

data is arrangedin rows and column format. It helps for application to retrieve
Structurced
data easily. Database management system is used for storing structured data.
and proccss

The term structured data refcrs to data that is identifiable because is organized in a it

eure. The most common form of structured data or records is a database where specific

is stored based on a methodology of columns and rows.


Enfornation

Structured
dataa is also searchable by data type within content. Structured data is understood

and is also efficiently organized for human readers.


by computers

an example of structured data,


An Excel table is

Unstructured Data
1.2.2

Tinstructured data is data that does not follow a specified format. Row and columns are not

nsed for unstructured data. Therefore it is difficult to retrieve required information

Unstructureddata has no identifiable structure.

The unstructured data can be in the form of Text : (Documents, email messages, custome

feedbacks),audio, video, images. Email is an example of unstructured data.

Even today in most of the organizations more than 80 % of the data are in unstructure

form. This carries lots of information. But extracting information from these vario

sourcesis a very big challenge.

Characteristics of unstructured data :

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4.There are no predefined formats, restriction or sequence for unstructured data.

5.Since there is no structuralbinding for unstructured data, it is unpredictable in nature

1.2.3 Natural Language

Natural language is a special type of unstructured data.

processing enables machines to recognize characters, words


Nalural language

Sentences,then apply meaning and understanding to that information. This helps mac

tounderstand language as humans do.


Fundamentals of Data Science and Analytics (1-8) Introduction to Data Science

• Natural language processing is the drivíng force behind machine intelligence in many

modern real-world applications. The natural language processing community has had

success in entity recognition, topic recognition, summarization, text completion and


sentiment analysis.

For natural language processing to help machines understand human language, it mustgo
through speech recognition, natural language understanding and machine translation. It is

an iterative process comprised ofseveral layers oftext analysis.

1.2.4 Machine- Generated Data


an information that is created without human interaction as a
• Machine-generateddata is

This means that data entered manually


resuit of a computer process or application activity.

by an end-user is not recognized to be machine-generated.

definitive record of all activity and behavior of our customers,


Machine data contains a

networks, factory machineryand so on.


users, transactions, applications, servers,
events, the output of
• It's configuration data, data from APIs and message queues, change
from remote equipmentand more.
commands and call detail records, sensor data
diagnostic
records, networkevent logs and
web server logs, call detail
• Examples of machine data are

telemetry.
Human-to-Machine(H2M) interactions generate
Both Machine-to-Machine (M2M) and
system, as
data is generated continuously by every processor-based
machine data. Machine
systens.
well as many consumer-oriented
of machine data has
In recent years, the increase
can be either structured or unstructured.
It as well as cloud
servers and desktops,
of mobile devices, virtual
surged. The expansion more complex.
technologies, is
making IT infrastructures
and RFID
based services

or Network Data
1.2.5 Graph-based between entities in

relationships
and interactions
structures to describe nodes and
Graphs are data
a collection
of entities called
a graph contains
In general,
complex systems.
edges.
a pair of nodes called
ofinteractions between our problem
anothercollection that is relevant
to
of any object type
entities, which can be (network)
of nodes.
Nodes represent will end up
with a graph
edges, we
nodes with Data is

domain. By connecting instead of tables


or documents.

relationships without
database stores
nodes and Our data is stored
A graph whiteboard.
ideas on a about and
stored just like we might sketch way of thinking
flexible
allowing a very
restricting it to a predefined model,

using it.

for knowledge
an up-thrust
ECHNICAL PUBLICATIONS.
Introductlon to Data Solence
pdamentels of Data Science and (1-9)
Analytics

graph-based data and are queried with specialized query


Graph databases are used to store

languages such as SPARQL.


fraud prevention.With graph databases, we
Graph databases are capable of sophisticated
in near-real time. With
can use relationships to process financial and purchase transactions

we able to detect that,for example, a potential purchaser is using the


fast graph queries, are

sanne email address and credit card as included in a known fraud case.

user easily detect relationship patterns such as multiple


Graph databases can also help
same IP
people associated with a personal email address or multiple people sharing the

address but residing in different physicaladdresses.

good choice forrecommendation applications. With graph databases,


Graph databases are a

we can store in a graph relationships between information categories such as customer

interests, friends and purchase history. We can use a highly available graph database to

make product recommendations to a user based on which products are purchased by others

who follow the same sport and have similar purchase history.

network analysis in the early history of


Graph theory is probably the main method in social

the social network concept. The approach is applied to social network analysis in order to

such as the nodes and links (for example


determine important features of the network

influencersand the followers).

network have been identified as users that have imnpact on the


Influencers on social

way of followership or influence on decision made by


activities or opinion of other users by

otherusers on the network as shown in Fig. 1.2.1.

Followers

Influencer

Flg. :
1.2.1 Influencer
Indroducton to Data Scisncs
Fundarmentels ofDale Solence andAnalytics (t-10)

Fig. 1.2.2 :Graph on 5 vertices


network
datasets such as social
be very effective on large-scale
Graph theory has proved to of an,actual visual
the building
it is capable of by-passing
data. This is because
data matrices.
of the data to run directly on
representation

1.2.6 Audio, Image and Video


t0 a data scientist.
pose specic challenges
image and video are data types that
Audio,
objeets in pictures,
tuTn outto be
such as recognizing
for humans.
Tasks that are trivial
challenging for comnputers.
format for
time-based media storage
• The terms audio and video commonly
pictures information.
refers to the

Audio and video digital


recording, also

sound/music and moving compressed or lossy


video codecs. can be uncompressed. lossless
referred as audio and
use cases.
on the desired quality and
compressed depending
sources of
is one of the most important
remark that multimedia data
. It is important to
and indexing of multimedia
the integration. transfornmation
information and knowledge:
data management and
analysis, Mny challengeshave
i
challenges
bring significant
data
naure of Data Science and
big data. multidisciplinar
to be addressed including

heterogeneity.
in multimediadata.
address these challenges
. Data Science is playing an important role
to

forms of media, such as text.


image. video,
Multimedia data usually contains various

even pulse waveforms, which


come from multiple sources
geographic coordinates and
and data mining
covering big data. machine leaning
Data Science can be a key instrument
data.
solutions to store, handle and analyze such heterogeneous

1.2.7 Streaming Data

generated continuously by thousands


of data sources,whic
Strcaming data is data that is

records simultaneously and in small sizes(orderof


Kilobytes).
typscally send in the data

TECHNICAL PUBLICATIONS-anup-thrust for knowiedge


Fundaentalsof Deta Soience and Analytics (1-11) Introduction to Deta Sciehce

Streaming data includes a wide variety of data such as log files generated by customers

using your mobile or web applications, ccommerce purchases, in-game player activity,

infomation from social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrunmentation in data centers.

1.2.8 Difference between Structured and Unstructured Data

Sr. No. Parameters Structured data Unstructured data

Representation It is in discrete form ie. Unstructured data is data

stored in tw and column that does not follow a

format specified format

2 Meta data Syntax Semantics

3 Storage Database management Unmanaged file structure

system

4. Standard SQL,ADO.net, ODBC Open XML,SMTO.


SMS
5. Integration tool ETL Batch processing or
manual data entry

6 Characteristics With a structure document, In unstructured document

certain information always informationcan appear in

appears in the sane unexpected places on the


location on the page. document.

7. Used by organizations Low volume operations High volume operations

L 1.3 Data Science Process

• Data scienceprocess consists of six stages:

1. Discovery or setting the rescarch goal

2. Retrievingdata

3.Data preparation

4. Data exploration

5. Data mnodeling

6. Presentation and automation

TECHNICAL PUBLICATIONS an up-thrust for knowiedgo


Introduction to
Fundamentals of Data Scence and Analytics (1 - 12) pata
Sclence

• Fig. 1.3.1 shows data science design process.


Fun

Defining research goais

Retrieving data

Data preparation

Exploratory data analysis

Build the model

Presenting findings and


building applications

Fig. 1.3.1 :Data science design process


Step 1: Discovery
or defining research goal
This step involves intemal and
acquiring data from al! the identified external SOurcesS,

which heips to answer the business


questi On.

Step 2: Retrieving data

It collection of data which required for project. This the processof gaining
is a business

understanding of the data user have and deciphering what each piece of data means. This

could entaildetermining exactiy what data is required and the best methods for obtaining
it.
This alsoentails determining what each of the data points means in terms of the company
If we have given a data set from a client, for exampie, we shall need to know what eac

column and row represents.

• Step 3:Data preparation

Data can have many inconsistencieslike missing values, blank coiumns. an incorrect da

format,which needs to be cieaned. We need to process, explore and condition data bef

modeling. The cleandata, gives the better predictions.

• Step 4:Data exploration

Data exploration is related to deeper understanding of data. Try to understand


variables interactwith each other,the distribution of the data and whether there are out!

Toachieve this use descriptive statistics, visual techniques and simple modeling. This

is also called as exploratory data analysis.


ECHNCAL PUBLICATIONS -an up-thrust for knowiedge
Furndamentals of Date Scienceand Analytics (1- 13) Introduction to Data Science

Step 5:Data modeling

In this step, the actual model building process starts. Here, Data scientist distributes

datasets for training and testing. Techniques like association, classification and clustering

areappliedto the training data set. The model, once prepared, is tested againstthe "testing"
dataset.

• Step 6 :Presentation and automation


Deliver the final baselined model with reports, code and technical documents in this stage.

Model is deployed into a real-time production environment after thorough testing. In this

stage, the key findingsare communicated to all stakeholders. This helps to decide if the

project results are a success or a failure based on the inputsfromn the model.

1.4 Defining Research Goals


• To understand the project, three concept must understand : What,why and how.

a) What is expectation of company or organization?

b) Why does a company's higher authority define such research value ?


c) How isit part of a bigger strategic picture ?
Goal of first phase will be the answer of these three questions.

In this phase, the data science team must learn and investigate the problem, develop context
and understandingand learm about the data sourcesneeded and available forthe project.

1. Learningthe business domain:

• Understanding the domain area of the problem is essential. In many cases, data scientists

willhave deep computational and quantitative knowiedge that can be broadly applied
across many disciplines.

Data scientists have deep knowiedge of the methods, techniques and ways for applying
heuristics toa variety of businessand conceptualproblems.
2. Resources:

• As part of the discovery phase, the team neèds to assess the resources available to
support the project. In this context, resources nclude technology. tools, systems, data
and peopie.
3.Frame the problem :

Framing is the processof stating the analytics problem to be solved. At this point, itis &

best practice to write down the problem statementand share it with the key stakeholders.

TECHNICAL PUBLICATIONS e uphruat tor knowtege


Introductionto Data
(1- 14) Science
Fundamentals of Data Science and Analytics
related to the needs
different things and the
hear slightly
• Each team member may
ideas of possible solutions.
different
have somewhat
problem and

4. ldentifying key stakeholders:


and stakeholders. which
key risks should
criteria,
the success
The team can identify be significantly impacted
project or
will bythe
from the

benefit
anyone who
will

include

project. area and any relevant


learn aboutthe domain history

stakeholders,
When interviewing
projects.
from similar analytics

5. Interviewing
the analytics sponsor:
to clarify and frame .
with the stakeholders
plan to collaborate
The team should
analytics problem. that may ma:
solution
a predetermined
sponsors may have
• At the outset, project
outcome.
necessarily realize the desired the tie
and expertise to identify
knowledge
the team must use
its

• In these cases.
solution.
and appropriate
underlying problenm to take time to
thoroughly
the team needs
the main stakeholders,
When interviewing
the project or
providing

tends to be the one funding


project sponsor.who
intervievw the

the high-level requirements.


potential working
has an idea of a
problem and usually
• This person understands the

solution.

initial hypotheses :
6. Developing it is best to

test with data. Generally,


that the team can
Thisstep involves forming ideas about developing
hypotheses to test and then be creative
with a few primary
come up
several more.
team will use in later
tests the
the basis of the analytical
• These Initial Hypotheses form
the findings in phase.
the foundation for
phases and serveas
datasources:
. ldentifying potential
needed to test the hypotheses.
of the data
type and time span
Consider the volume. most ue
In cases,
simply aggregated data.
can access more than
Ensure that the team analysIS.
bias for the downstream
to avoid introducing
team will need theraw data

knowledge
TECHNICAL PUBLICA TIONS- an up-thrust for
Fundanentals of Data Science and Analytics (1 15) Introduction to Data Science

L 1.5 Retrieving Data

Retrieving required data is second phase of data science project. Sometimes Data scientists
need to go into the field and design a data collection process. Many companies will have
already collected and stored the data and what they don't have can often be bought from
third parties.

Most of the high quality data is freely available for public and commercial use. Data can be
stored in various format. in text
It is file formatand tables in database. Data may be internal
or exteImal.

1. Startworking on internal data, i.e. data stored within the company

• First step of data scientists is to verify the internal data.


Assess the
relevance and quality
of the data that's readily in company. Most companies have a program for
maintaining
key data, so much of the cleaning work may already be done. Thisdata can be
stored in
official data repositories such as databases, data marts, data warehouses and data lakes
maintained by a team of IT professionals.

Data repository is also known as a data libraryor data archive. This is a general term
to refer to a data set isolated to be mined for data reporting and analysis. The data
repository is a large database infrastructure, several databases that collect, manage and
store data sets for data analysis, sharing
and reporting.

Data repository can be used to describe several ways to collect and store data :

a) Data warehouse is a large data repository that aggregates data usually from multipl
sOurces or segments of a business, withoutthe data being necessarily related.

b) Data lake is a large data repository that stores unstructured data that is classified an

tagged with metadata.

c) Data marts are subsets of the data repository. These data marts are more targeted
t
what the data user needs and easier to use.

d) Metadata repositories store data about data and databases. The metadata explain

where the data source,how it was capturedand what it represents.

e) Data cubes are lists of data with three or more dimensions stored as a table.

Advantages of data repositories:

i. Data is preserved and archived.

ii. Data isolation allows for easier and faster data reporting.

ii. Database administratorshave casier time tracking problems.

iv. There is value to storing and analyzing data.

TECHNICAL PUBLICA TIONS - an up-thrust for knowledge


FundamentalsofData Scienceand Analytics (1- 16) Introduction to Data Science

Disadvantages ofdata repositories :

i. Growing data sets could slow down systems.

L A system crash could affect all the data.

data more easily than if it was distributed


t.Unauthorized users can access all sensitive

acrossseverallocations.

2. Do not be afraid to shop around


help of other company,
within the company, take the
t required data is not available
and GFK are provides data
such types of database. For example, Nielsen
Whichprovides
Twitter, LinkedIn and Facebook.
Data scientists also take help of
tor retail industry.
world. This data can be of
share their data for free with the
• Government's organizations

that creates and manages it. The


excellent quality; it depends on the institution

range of topics such as the number


ofaccidents or
informationthey share covers a broad
region and demographics.
amount ofdrug abuse in a certain its

problem
3. Perform data quality checks to avoid later

and data cleaning. Collecting suitable.


• Allocate or spend some time for data correction

eror free data is success of the data science project.

Most of the errors encounter during the data gathering phase are easy to spot, but being

spend many hours solving data issues that could


too careless willmake data scientists

. have been prevented during data import.

Data scientists must investigate the data during the import, data preparation and

exploratoryphases. The differenceis in the goal and the depth ofthe investigation.

• In data retrieval process,verifywhether the data is right data type and data is same as in

the source document.

With data preparation process, more elaborate checks performed. Check any shortcut

method is used. For example, check time and data format.

• During the exploratory phase, Data scientists focus shifts to what he/she can learm from

the data. Now Data scientists assume the data to be clean and look at the statistical

propertiessuch as distributions, correlationsand outliers.


Fundamentals of Data Science and Aneiytics (1- 17) Introductionto DataSclence

A 1.6 Data Preparation

• Data preparation means data cleansing, Integrating and transforming data.

1.6.1 Data Cleaning

• Data is cleansed through processes such as filling in missing values,


smoothing the no1sy
data or resolving the inconsistencies in the data.

Data cleaning tasks are as follows:

1. Data acquisition and metadata 2. Fill in missing values


3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data 6. Correct inconsistent data

Data cleaning is a first step in data pre-processing


techniques which is used to find the
missing value, smooth noise data, recognize outliers and correct inconsistent.

Missing value :These dirty data will affects on miming procedure and led to unreliable and
poor output. Therefore it is important for some data cleaning routines. For example,
suppose that the average salary of staff is ? 65000/-. Use this value to replace the missing
value for salary.

• Data entry errors : Data collection and data entry are error-prone
processes. They often
require human and because humans are only human, they make
intervention
typos or lose
their concentration for a second and introduce an error into the chain. But data
collected by
machines or computers isn't free from errors either. Errors
can arise from human
sloppiness,whereas others are due to machine or hardware failure. Examples of errors
originating from machines are transmission errors or bugs in the extract, transform and load
phase (ETL).

• Whitespace error : Whitespaces tend to be hard to detect but cause errors like other

redundant characters would. Toremove the spaces present at start and end of the string, we
can use strip() functionon the string in Python.

• Fixing capital letter mismatches : Capital letter mismatches are common problem. Most

programming languages make a distinction between "Chennai" and "chennai".

• Python provides string conversion like to convert a string to lowercase, uppercase using

lower(),upper().

TECHNICAL PUBLICATIONS-an up-thrust for knowledge


(1- 18) iotroduction to Data
Fundamentals of Data Sciernce and Analytics Science

• The lower ) function in python converts the input string to lowercase. The upper( )
function in python converts the input string to uppercase.

A6.2 Outliat
and subsequentlyexcluding from a
• Outlier detection is the process of detecting outliers

is to use a table with the


gIven set of data. The easiest way to find outliers a plot or

minimum and maximum values.

• Fig. l.6.1 showsoutliers detection. Here O, and O, seem outliers from the rest.

Fig. 1.6.1 :Outliers detection

An outlier may be defined as a piece of data or observation that deviates drastically from

the given norm or average of the data set. An outlier may be caused simply by chance, but
it may also indicate measurement error or that the given data set has a heavy-tailed

distribution.

Outlier analysis and detection has varous applications in numerous fields such as fraud

detection, credit card, discovering computer and


intrusion criminal behaviours,medical and
public health outlier detection, industrial damage detection.

• General idea of application is to find out data which deviates from nomal behaviour of data
Set.

1.6.3 Dealing with Missing Value

These dirty data will affects on miming procedure and led tounreliable and poor output.

Therefore it isimportant for some data cleaning routines.

TECHNICAL PUBLICATIONS-anup-thrust for knowfedge


Fundamentats of Data SGience and Anelytics (1-19) Introduction to Data Science

How to handle noisy data in data mining ?


• Following methods are used for handling noisy data :
1. Ignore the tuple : Usually done when the class label is missing. This method is not
good unless the tuple contains several attributes
with missing values.
2. Fill in the missing value manually : It is time-consumingand not suitable for a large
data set with many missing values.
3. Use a global constant to fill in the missing value :Replace all missing
attribute values
by the same constant.
4. Use the attribute mean to fill in the missing value :For example, suppose that the
average salary of staff is ? 65000/-. Use this value to replace themissing value for
salary.

5. Use the attribute mean forall samples belongingto the sameclass as the giventuple.
6. Use the most probable value to in the
fill missingvalue.

1.6.4 Correct Errors as Early as Possible


• If error is not corrected in early stage of project, then it create problem in latter stages. Most
of the time, we spend on finding and correcting error. Retrieving data is a difficult task and
organizations spend millions of dollars on in the hope of making
it better decisions. The
data collection process is errorprone and in a big organization it involves many steps and
teams.

• Data should be cleansed when acquiredfor many reasons:


a) Not everyone spots the data anomalies. Decision-makers may make costly mistakes on
information based on incorrect data from applications that fail to correct for the faulty
data.

b) Iferrors are not corrected early on in the process, the cleansing will have to be done for

every project that uses that data.

c)Data erTOrs may pointto a businessprocess that isn't working as designed.

d) Data erors may point to defective equipment, such as broken transmission lines and
defectivesensors.

e) Data errors can point to bugs in software or in the integration of software that may be

critical to thecompany.

TECHNICAL PUBLICATIONS- an up-thrust for knowiedge


introduction to Data
Fundementals ofData Science andAnalytics (1- 20) Sclence

Data Sources
1.6.5 Combining Data from Different

1. Joining table

of one observation found in


user to combine the information One
Joining tables allows
The focus is on enriching
we find
in another table. a
table with the information that

single observation.

within a table. This


means that

•A primary key is a value that cannot be duplicated


key column. Ihat same key can exist ae
within the primary
value can only be seen once
creates the relationship. A foreign key can have
a foreign key in another table which

duplicate instances within a table.

on and Country Name keys.


• Fig. l.6.2 showvs joining two tables the CountryID

Date Units CountrylD Country Name


CountrylD

100 1001 India


10/10/2021 1001

21/10/2021 3001 50 3001 UK

31/10/2021 4001 75 4001 USA

01/10/2021 3001 3001 Spain

Date CountryD Units CountryName


10/10/2021 1001 100 India

21/10/2021 3001 50 USA


31/10/2021 4001 75 Spain
01/10/2021 3001 90 USA
Fig. 1.6.2: Joining two tables

2. Appending tables

Appending table is called stacking table. It eftectively adding observations from one
table to another table. Fig. 1.6.3
shows Appending table. (See Fig. 1.6.3 on next page)

TECHNICAL PUBLCATIONS an up-thrust for knowledge


ndemertas of Data Science and Analytics (1- 21) Introduction to Data Science

Table 1
Table 2
x3 x3

11 k 33
2 b 3 12 33
3 3 13 m 33
4 3 14 33
e 3 15 33

Table 3
2 x3
1
3
2 b
3 3
4 3
5 3
wwww

11 k 33

12 1
33
13 m 33

14 33

15 33

Fig. 1.6.3: Appending table

• Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The result of


appending these tables is a larger one with the observations from Table 1 as well as
Table 2. The equivalent operation in set theory would be the union and this is also the

command in SQL,the common language of relational databases. Other set operators are

also used in data science, such as setdifference and intersection.

Using views to simulate data joins and appends

Duplication of data is avoided by using view and append. The append table requires

more space for storage. If table size is in terabytes of data, then it becomes problematic

to duplicate the data. For this reason, the concept of a view was invented.
Introduction to Data Sciencó

andAnalytics (1-22)
Fundamentals of Date Science
combined virually into
months is
from the different
shows how the sales data
• Fig. 1.6.4
the data.
instead of duplicating
a yearly sales table Dec Sales
Feb Sales Date
Jan Sales ob
ob Date 1-Dec
Date Physical tables
1-Feb
1 1-Dec
1Jan
1-Feb
2 1-Jan
31-Dec
28-Feb
31-Jan

Yearly sales

ob Date

1-Jan view
Virtual table
1

2 1-Jan

31-Dec

Fig. 1.6.4 : View

1.6.6 Transforming Data


for
into forms appropriate
transformed or consolidated
the data are
• In data transformation,
always
and an output variable aren't
mining. Relationships between an input variable

linear.

number of variables : Having too many variables in the mnodel makes the
Reducing the

techniques don't perform well when user overload


model difficult to handle and certain

them with too many input variables.

• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data

scientists use special methods to reduce the number of variables but retain the maximum

amount of data.

Euclidean distance :

Euclidean distance is used tomeasure the similarity between observations.It is calculated

as the square root of the sum of differences between each point.

Euclidean distance =
of Data Science and
nsamentas Analytics (1-23) Introduction to Data Science

urning variable into dummies :


Variables can be turned into dummy variables. Dummy variables canonly take twO values :
we (1) or false (0). They're used to indicate the absence of acategorical effect that may
exnlain the observation.

Custonmer Sales Date Gender


100 Jan-21 M
3 20 Dec-20 F
2 400 May-22 F
1 500 Jan-22 M
10 45 M
Aug-21
7 300 Dec-21

250 July-22 F

Customer Sales Date Male Female


1 100 Jan-21 1

500 Jan-22 1

2 400 May-22 1

3 20 Dec-20 1

7 300 Dec-21 1

250 July-22

10 45 Aug-21 1

1.7 Exploratory Data Analysis

Exploratory Data Analysis (EDA)is a general approach to exploring datasets by means of

simple summary statistics and graphic visualizations in order to gain a deeper

understandingof data.

EDA is used by data scientists to analyze and investigate data sets and summarize their

main characteristics, often employing data visualization methods. It helps determine how

best to manipulate data sources to get the answers user need, making it easier for data

Scientists to discover patterns, spot anomalies, test a hypothesis or check assumptions.

TECHNICAL PUBLICATIONS-an up-thrust for knowledge


ionto
Data
• EDA is an approach / philosophy for data analysis that employs a variety of
Science

1. Maximize insight into a data set: 2. Uncover ftechniques


underlying structure: to:
4 Detect outliers and
3. Extract important variables: anomalies:

5. Test underlying assumptions; 6. Develop parsimonious


models: and

7. Determine optimal factor settings.

With EDA,following functions are performed :

1. Describe of user data

2. Closely explore data distributions

3. Understand the relations between variables

4. Notice unusual or unexpected situations

5. Place the data into groups

6. Notice unexpected patterns within groups

7. Take note of group differences.

excellent tool for conveying location and vanation infomation in d


Box plots are an
and location and variation changes
sets, particularly for detecting illustrating between

different groups of data.

• Exploratory data analysis is majorly performed


using the following methods :
summary statistics for each fieid in the raw data set (or)
1. Univariate analysis : Provides

summary only on one variable. Example CDFPDF,Box


plot. :
performed to find the relationship berween each variable in the
2. Bivariate analysis is

dataset and the target variable of interest (or) using two variables and finding

relationship between them. Example : Boxplot, Vioiinplot.

performed to understand interactions between different fields in


3. Multivariate analysis is

more than 2.
the dataset (or) finding interactions between variables
to visuallyshow
• A box plot is a type of chart often used in explanatory data analysis
the

of numerical data and skewness through displaying the data quartiles or


distribution

percentile and averages.


Eundarnentals ofData Science end Analytics (1- 25) Introduction to Data Science

LOwer aquartile dpper quartie


Median
Min Max
Whisker

Box

Interquartile |range (1QR)

Fig. 1.7.1
1. Minimum score :The lowest score, exlcuding
outliers.
2. Lower quartile: 25 % of scores fall below the lowerquartile value.
3. Median :The median marks the mid-point of the data and is shown by the line that
divides the box into two parts.
4. Upper quartile :75 % ofthe scores fall below the upperquartiel value.
5. Maximum score The highest : score, excluding outliers.

6. Whiskers :The upper and lower


whiskers represent scores outside the middle 50 o.
7. The interquartile :
range This is the box plot
of scores. showing the middle 50 %
Boxplotsare also extremely usefule for visually
checking group differences. Suppose we
have four groups of scores and we want to
compare them by teaching method. Teaching
method is our categorical grouping variable and score is
the continuous outcomes variable
that the researchers measured.

Boxplotof score

40

354

30

25
SCOre
204

154

101
Method 1 Method 2 Method 3 Method 4

Teaching Method

Fig. 1.7.2

TECHNICAL PUBLICATIONS.an up-thrust for knowedge


Fundamentals of Data Scienceand Analytics (1 - 26) introduetion t0
Data

1.8 Build the Modeis


understand the
To build the should be clean and content
model, data properly. The
:
components of model building areas follows

a) Sclection of model and variable

b) Execution of model

model comparison.
c) Model diagnosticand
consist of the following main
Most models steps
process.
• Building a model is an iterative
enter in the model
technique and variablesto
of amodeling
I. Selection

2. Execution of the model


3. Diagnosis

1.8.1

For this phase,


Model

touse model. as well


and model comparison.

and Variable Selection


consider
asother factors
model performance and
:
whether project meets all |
the
requirements

and. 11 So, would itbe casy to


be moved to a production environnent
1. Must the model

implement ?

2. How difficult isthe maintenance on the


model :How long Wil! it remain relevant if 1ot

untouched ?
to be easy to explain ?
3. Does the model need

• 1.8.2

Pvthon provides
Model Execution

Various programming language


libraries like
is used for implementing the

StatsModels or Scikit-learn.These
model. For model execution.

packages use several of

the most popular techrniques.


available can
Coding a model is a nontrivial task in most cases, so having these libraries

the remarks on output :


speed up the process. Following are

a) Model fit :R-squared or adjusted R-squared is used.

:
b) Predictor variables have a coefficient For a linear model this is easy to interpret.

c) Predictor significance : Coefficients are great, but sometimes not enough evidence
exists to show that the influence is there.

Linear regression works if we want to predict a value, but for classify something,
classification models are used. The k-nearestneighbors method is one of the best method.
ndementals of Data Scienceand Analytics
{1-27) Introduction to Data Science

Following commercial tools are used:

1. SAS enterprise miner This tool


hased on large volumes
: allows users to run predictive
and descriptive models
of data from across the enterprise.
2, SPSS modeler:It offers
methods to explore and analyze
3. Matlab :
Provides a high-level
language for performing a
data through a GUI.

algorithmsand data
variety of data analytiCs,
exploration.
4. Alpine miner:This tool provides a GUI front end
for users to develop
workflows and interact with Big Data tools analytic
and platformson the back
end.
Open Source tools :

1. R and PLR:PL/R isa procedural language


2. Octave :A free software
ofthe functionalityof
programming language for
for PostgreSQL with
R.

computational modeling, has some


Matlab.
3. WEKA : a free data mining
It is
software package with an
analytic workbench. The
functions created in WEKA can
be executed within Java code.
4. Python is a programming
language that provides toolkits for machine learning and
analysis.

5. SQL in-database implementations, such as MADlib provide an alterative to in memory


desktop analytical tools.

1.8.3 Model Diagnostics and Model Comparison


Iry to build multiple model and then select
best one based on multiple criteria.
Working
with a holdout sample helps user pick the
best-performing model.
In Holdout Method, the data is split into two different datasets labeled as a training and a
testing dataset. This can be a 60/40 or 70/30or 80/20 split. This technique is called the
hold-out validation technique.

Suppose we have a database with house prices as the dependent variable and two

independent variables showing the square footage of the house and the number of rooms.
Now, imagine this dataset has 30 rows. The whole idea is that you build a model that can

predict house prices accurately.

To'train'our model or see how well it performs,we randomly subset 20 of those rows and

fit themodel. The second step is topredict the values of those 10 rows that we excluded

and measure how well our predictions were.


introduction to Data
Fundamentals of Data ScenceendAnalytics (1-28) Science

% of the data into the


• As sample 80
Suggestto randomly
a rule of thumb, training
experts

set and 20 % into the test set.

The holdout method has two,basic drawbacks:

1. Itrequires extra dataset. of


estimate error rate will be
the holdout
2. It is a single train-and-test experiment,
split.
"unfortunate"
misleading if we happen to get an
Applications
1.9 Presenting Findings and Building
documents.
code and technical
briefings,
Theteam delivers final Ireports, in a product:.
the models
project to implement
may run a pilot
In addition, team
environment. will be most useful
is where user soft skills
data science process
The last stageof
the
your analysis process for
stakeholders
and industrializing
results to the
Presenting your
with other toois.
and integration
repetitive reuse

Questions with Answers


1.10 Two Marks
Q.1 What is data science ?
Ans. :
or insights from
field that seeks to extract knowledge
Data science is an interdisciplinary

various forms of data.


from data that
discover and extract actionable knowledge
core, data science aims to
At its

and predictions.
can be used to make sound business decisions
theory and various methods such as time series
analytical
Data science uses advanced

for predicting future.


analysis

Q.2 Define structured data.

Ans. : Structured data is aranged in rows and column format. It helps for application to

Database management system is used for storing structured


retrieve and process data easily.

data that is identifiable because it is organized in a


data. The term structured data refers to

structure.

Q.3 What is data ?


Ans. : Data set is collection of related records or information. The information may be on
Some entity or sorme subject area.
What is unstructured data ?
Unstructured data is data that does not
follow a specified format. Row and
Bsed for unstructured columns are
data. Therefore it is difficult to retrieve
required information.
actured data has no identifiable structure.

What is machine-generated data ?


L:Machine-generated data is an information that is
created without human interaction asa
dt of a computer process or application activity.This means that data entered manually by
nd-user is not recognized to be machine-generated.
Define streaming data.
.: Streaming data is data that is generated continuously by thousands of data sources,
ch typically send in the data records simultaneously
and in small sizes (orderof Kilobytes).
List the stages of data science process.

s.: Stages of data science process are as follows :

Discoveryor setting the research goal

•Retrieving data

Data preparation

• Data exploration

• Data modeling
• Presentation and automation.
8 What are the advantages of data repositories ?
: Advantages are as follows:
Is.

• Data is preserved and archived.


• Data allows for
isolation and easier faster data reporting.

Database administrators have easier time tracking problems.

• There value to
is and analyzing
storing data.

9 What is data cleaning?


and collecting necessa
inconsistent data or noise
1s.: Data cleaning means removing the
formation ofa collection ofinterrelated
data.
Q.10 What is outlier detection ?
Ans. : Outlier detection is the process of detecting and subsequently excluding outliers from a
given set of data. The easiest way to find outliers is to use a plot or a table with the minimum
and maximum values.

a.11 Explain exploratory data


analysis.
by means
Data Analysis (EDA)is a general approach to exploring datasets
nis ENploratory
a decper understanding
and graphic visualizations in order to gain
of stmple summary statistics
data sets and sunmarizetheir
Of data. EDA isused by data scientiststoanalvze and investigate

man characteristics, often employing data visualization methods.

Q.12 What is data cleaning ?


necessary
the inconsistent data or noise and collecting
ANS. Data cleaning means removing
information of a collection of interrelated data.

Q.13 List the stages of data science process.

Ans. : Data science process consists of six stages :

2. Retrieving data
1. Discovery or setting the research goal.

4. Data exploration
3. Data preparation
6. Presentation and automation.
5. Data modeling

a.14 What isdata repository?


term to
librarv or data archive. This is a general
Ans. : Datarepository is also known as a data
be mined for data reporting and analvsis. The data repository is a
efer to a data set isolated to
data
that collect. manage and store data sets for
arge database infrastructure, several databases
nalysis, sharing and reporting.

.15 List the data cleaning tasks ?


ns. :Data cleaning are as follows:

1. Data acquisition and metadata 2. Fill in missing values

4. Converting nominal to numeric


3. Unified date format

data 6. Correct inconsistent data.


5. Identify outliers and smooth out noisy

16 What is Euclidean distance ?


s. :Euclidean distance is used to measure the similarity between observations. It is

alculated asthe square root of the sum of differences between each point.

You might also like