FDSA
FDSA
FDSA
Sclence
• Data science is an interdisciplinary field that seeks to extract knowiedge or insights fro
various forms of data. At its core, Data Science aims tô iscover and extract actionahl.
knowledge from data that can be used to make sound business
decisions and predictions.
• Data science combines math and statistics, specialized programming, advanced analytics.
Artificial Intelligence (AI)and machine learning with specific subjectmatter expertise to
uncover actionableinsights hidden in an
organization's data.
• Data science uses advanced analytical theory and various methods such as time series
analysis for predicting From
historical data, Instead ofknowing how many
future.
products
Sold in previous quarter, data sciencehelps in forecasting
future product sales and revenue
more accurately.
• Data science is devoted to the extraction of clean information from raw data to form
actionable insights. Data science practitioners apply machine learning algorithms to
numbers, text, images, video, audio and more to produce artificial intelligence systems to
perform tasks that ordinarily require
human intelligence.
The data science field is growing rapidly and revolutionizing so many industries. It has
incalculable benefits in business, research and our everyday lives.
• As a general rule, data
scientists are skilled in detecting patterns hidden
within large
volumes of data and they often use advanced
algorithms and implement machinelearning
modeis to help businesses and organizations make accurate
assessmentsand predictions.
Data science and big data evolved. from and
statistics traditional data management but are
now consideredto be distinct disciplines.
1. Capture :Data acquisition, data entry, signal recepton and data extraction.
2. Maintain : Data warehousing, data cleansing, data staging. data processing and dat
architecture.
3. Process: Data mining, clustering and classification, data modeling and dat
summarization.
4. Analyze :Data reporting. data visualization, business intelligence and decisson making.
varying degrees of complexity, generated at different speed i.e. velocities and varying
degrees of ambiguity, which cannot be processed using
traditional technologies, processing
methods, algorithms or any commercial off-the-shelfsolutions.
•'Big data' is a term used to describe collection of data that is huge in size and yet growing
exDonentially with time. In short, such a data is so large and complex that none of the
traditional data management tools are able to store it or process it efficiently.
• Characteristics of big data are volume,velocity and variety. They are often referred to as
the three V's.
1. Volume : Volumes of data are larger than that conventional relational database
2. Velocity :The term 'velocity' refers to the speed of generation of data. How fast the data
is generated and processed to meet the demands,determines real potential in the data. It
3.Variety : It refers to heterogeneous sources and the nature of data, both structured and
unstructured.
These three dimensions are also called as three V's of Big Data.
Structured 1. Batch
1. Records 1.
2. Stream
2. Pictures 2. Semi-structured
4. Terabyte
Veracity:
• Spatial
quality
veracity
varies.
: For vector data (imagery based
on whether the points have been GPS deteming
It depends
n points lines and
polygons te
detemined by unknown origins or manually. Alsa, resolutienand projectien issue
alter veracity.
• For geo-coded points, there may be errors in the address tables and in the point leca
algorithms associatedwith addresses.
• For raster data (imagery based on pixels). veracity depends on accuracy et rorne
instruments in satellites or aerial devices and on timeliness.
b) Value :
For real-time spatial big data, decisions can be enhance through visualhzation of
dynamic change in such spatial phenomena as climate, trattic, socil-meta-basi
attitudes and massive inventory locations.
connection.
market patterns
)Pattern detection :Weather patterns, financial
and text
e) Recognition :Facial, voice
Recommendation :
Based on earned preferences, recommendation engines
and books
user to movies, restaurants
(1- 6)
ofData Soience
andAnalytics
FUndamentals
5.
Re-develop our
products :Big Data
products so that we
Early
can also help us
identification of risk
understand
form of
have today. Facebook generates around
3. Aviation industry :A single jet engine can generate around 10 terabytes of ata during
a 30 minute flight.
4. Survey data :Online or offline surveys conducted on various topics which typically has
hundreds and thousands of responses and peeds to be processed' for analysis and
visualization by creating a cluster of population and their associated resnonses
5. Compliance data :Many organizations like halthcare, hospitals. life sciences, finance
Ver large amountof data will generate in big ata and data seience. These data is various
e) Graph-based
) Audio, video and images
g) Streaming un
for knowedge
#HNCA FLBLCATICNS. an up-thrust
and (1-7) Introduction to Data Science
of Data Science Analytics
nAmentais
Data
1.2.1 Structured
data is arrangedin rows and column format. It helps for application to retrieve
Structurced
data easily. Database management system is used for storing structured data.
and proccss
The term structured data refcrs to data that is identifiable because is organized in a it
eure. The most common form of structured data or records is a database where specific
Structured
dataa is also searchable by data type within content. Structured data is understood
Unstructured Data
1.2.2
Tinstructured data is data that does not follow a specified format. Row and columns are not
The unstructured data can be in the form of Text : (Documents, email messages, custome
Even today in most of the organizations more than 80 % of the data are in unstructure
form. This carries lots of information. But extracting information from these vario
Sentences,then apply meaning and understanding to that information. This helps mac
• Natural language processing is the drivíng force behind machine intelligence in many
modern real-world applications. The natural language processing community has had
For natural language processing to help machines understand human language, it mustgo
through speech recognition, natural language understanding and machine translation. It is
telemetry.
Human-to-Machine(H2M) interactions generate
Both Machine-to-Machine (M2M) and
system, as
data is generated continuously by every processor-based
machine data. Machine
systens.
well as many consumer-oriented
of machine data has
In recent years, the increase
can be either structured or unstructured.
It as well as cloud
servers and desktops,
of mobile devices, virtual
surged. The expansion more complex.
technologies, is
making IT infrastructures
and RFID
based services
or Network Data
1.2.5 Graph-based between entities in
relationships
and interactions
structures to describe nodes and
Graphs are data
a collection
of entities called
a graph contains
In general,
complex systems.
edges.
a pair of nodes called
ofinteractions between our problem
anothercollection that is relevant
to
of any object type
entities, which can be (network)
of nodes.
Nodes represent will end up
with a graph
edges, we
nodes with Data is
relationships without
database stores
nodes and Our data is stored
A graph whiteboard.
ideas on a about and
stored just like we might sketch way of thinking
flexible
allowing a very
restricting it to a predefined model,
using it.
for knowledge
an up-thrust
ECHNICAL PUBLICATIONS.
Introductlon to Data Solence
pdamentels of Data Science and (1-9)
Analytics
sanne email address and credit card as included in a known fraud case.
interests, friends and purchase history. We can use a highly available graph database to
make product recommendations to a user based on which products are purchased by others
who follow the same sport and have similar purchase history.
the social network concept. The approach is applied to social network analysis in order to
Followers
Influencer
Flg. :
1.2.1 Influencer
Indroducton to Data Scisncs
Fundarmentels ofDale Solence andAnalytics (t-10)
heterogeneity.
in multimediadata.
address these challenges
. Data Science is playing an important role
to
Streaming data includes a wide variety of data such as log files generated by customers
using your mobile or web applications, ccommerce purchases, in-game player activity,
infomation from social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrunmentation in data centers.
system
2. Retrievingdata
3.Data preparation
4. Data exploration
5. Data mnodeling
Retrieving data
Data preparation
It collection of data which required for project. This the processof gaining
is a business
understanding of the data user have and deciphering what each piece of data means. This
could entaildetermining exactiy what data is required and the best methods for obtaining
it.
This alsoentails determining what each of the data points means in terms of the company
If we have given a data set from a client, for exampie, we shall need to know what eac
Data can have many inconsistencieslike missing values, blank coiumns. an incorrect da
format,which needs to be cieaned. We need to process, explore and condition data bef
Toachieve this use descriptive statistics, visual techniques and simple modeling. This
In this step, the actual model building process starts. Here, Data scientist distributes
datasets for training and testing. Techniques like association, classification and clustering
areappliedto the training data set. The model, once prepared, is tested againstthe "testing"
dataset.
Model is deployed into a real-time production environment after thorough testing. In this
stage, the key findingsare communicated to all stakeholders. This helps to decide if the
project results are a success or a failure based on the inputsfromn the model.
In this phase, the data science team must learn and investigate the problem, develop context
and understandingand learm about the data sourcesneeded and available forthe project.
• Understanding the domain area of the problem is essential. In many cases, data scientists
willhave deep computational and quantitative knowiedge that can be broadly applied
across many disciplines.
Data scientists have deep knowiedge of the methods, techniques and ways for applying
heuristics toa variety of businessand conceptualproblems.
2. Resources:
• As part of the discovery phase, the team neèds to assess the resources available to
support the project. In this context, resources nclude technology. tools, systems, data
and peopie.
3.Frame the problem :
Framing is the processof stating the analytics problem to be solved. At this point, itis &
best practice to write down the problem statementand share it with the key stakeholders.
include
stakeholders,
When interviewing
projects.
from similar analytics
5. Interviewing
the analytics sponsor:
to clarify and frame .
with the stakeholders
plan to collaborate
The team should
analytics problem. that may ma:
solution
a predetermined
sponsors may have
• At the outset, project
outcome.
necessarily realize the desired the tie
and expertise to identify
knowledge
the team must use
its
• In these cases.
solution.
and appropriate
underlying problenm to take time to
thoroughly
the team needs
the main stakeholders,
When interviewing
the project or
providing
solution.
initial hypotheses :
6. Developing it is best to
knowledge
TECHNICAL PUBLICA TIONS- an up-thrust for
Fundanentals of Data Science and Analytics (1 15) Introduction to Data Science
Retrieving required data is second phase of data science project. Sometimes Data scientists
need to go into the field and design a data collection process. Many companies will have
already collected and stored the data and what they don't have can often be bought from
third parties.
Most of the high quality data is freely available for public and commercial use. Data can be
stored in various format. in text
It is file formatand tables in database. Data may be internal
or exteImal.
Data repository is also known as a data libraryor data archive. This is a general term
to refer to a data set isolated to be mined for data reporting and analysis. The data
repository is a large database infrastructure, several databases that collect, manage and
store data sets for data analysis, sharing
and reporting.
Data repository can be used to describe several ways to collect and store data :
a) Data warehouse is a large data repository that aggregates data usually from multipl
sOurces or segments of a business, withoutthe data being necessarily related.
b) Data lake is a large data repository that stores unstructured data that is classified an
c) Data marts are subsets of the data repository. These data marts are more targeted
t
what the data user needs and easier to use.
d) Metadata repositories store data about data and databases. The metadata explain
e) Data cubes are lists of data with three or more dimensions stored as a table.
ii. Data isolation allows for easier and faster data reporting.
acrossseverallocations.
problem
3. Perform data quality checks to avoid later
Most of the errors encounter during the data gathering phase are easy to spot, but being
Data scientists must investigate the data during the import, data preparation and
exploratoryphases. The differenceis in the goal and the depth ofthe investigation.
• In data retrieval process,verifywhether the data is right data type and data is same as in
With data preparation process, more elaborate checks performed. Check any shortcut
• During the exploratory phase, Data scientists focus shifts to what he/she can learm from
the data. Now Data scientists assume the data to be clean and look at the statistical
Missing value :These dirty data will affects on miming procedure and led to unreliable and
poor output. Therefore it is important for some data cleaning routines. For example,
suppose that the average salary of staff is ? 65000/-. Use this value to replace the missing
value for salary.
• Data entry errors : Data collection and data entry are error-prone
processes. They often
require human and because humans are only human, they make
intervention
typos or lose
their concentration for a second and introduce an error into the chain. But data
collected by
machines or computers isn't free from errors either. Errors
can arise from human
sloppiness,whereas others are due to machine or hardware failure. Examples of errors
originating from machines are transmission errors or bugs in the extract, transform and load
phase (ETL).
• Whitespace error : Whitespaces tend to be hard to detect but cause errors like other
redundant characters would. Toremove the spaces present at start and end of the string, we
can use strip() functionon the string in Python.
• Fixing capital letter mismatches : Capital letter mismatches are common problem. Most
• Python provides string conversion like to convert a string to lowercase, uppercase using
lower(),upper().
• The lower ) function in python converts the input string to lowercase. The upper( )
function in python converts the input string to uppercase.
A6.2 Outliat
and subsequentlyexcluding from a
• Outlier detection is the process of detecting outliers
• Fig. l.6.1 showsoutliers detection. Here O, and O, seem outliers from the rest.
An outlier may be defined as a piece of data or observation that deviates drastically from
the given norm or average of the data set. An outlier may be caused simply by chance, but
it may also indicate measurement error or that the given data set has a heavy-tailed
distribution.
Outlier analysis and detection has varous applications in numerous fields such as fraud
• General idea of application is to find out data which deviates from nomal behaviour of data
Set.
These dirty data will affects on miming procedure and led tounreliable and poor output.
5. Use the attribute mean forall samples belongingto the sameclass as the giventuple.
6. Use the most probable value to in the
fill missingvalue.
b) Iferrors are not corrected early on in the process, the cleansing will have to be done for
d) Data erors may point to defective equipment, such as broken transmission lines and
defectivesensors.
e) Data errors can point to bugs in software or in the integration of software that may be
critical to thecompany.
Data Sources
1.6.5 Combining Data from Different
1. Joining table
single observation.
2. Appending tables
Appending table is called stacking table. It eftectively adding observations from one
table to another table. Fig. 1.6.3
shows Appending table. (See Fig. 1.6.3 on next page)
Table 1
Table 2
x3 x3
11 k 33
2 b 3 12 33
3 3 13 m 33
4 3 14 33
e 3 15 33
Table 3
2 x3
1
3
2 b
3 3
4 3
5 3
wwww
11 k 33
12 1
33
13 m 33
14 33
15 33
command in SQL,the common language of relational databases. Other set operators are
Duplication of data is avoided by using view and append. The append table requires
more space for storage. If table size is in terabytes of data, then it becomes problematic
to duplicate the data. For this reason, the concept of a view was invented.
Introduction to Data Sciencó
andAnalytics (1-22)
Fundamentals of Date Science
combined virually into
months is
from the different
shows how the sales data
• Fig. 1.6.4
the data.
instead of duplicating
a yearly sales table Dec Sales
Feb Sales Date
Jan Sales ob
ob Date 1-Dec
Date Physical tables
1-Feb
1 1-Dec
1Jan
1-Feb
2 1-Jan
31-Dec
28-Feb
31-Jan
Yearly sales
ob Date
1-Jan view
Virtual table
1
2 1-Jan
31-Dec
linear.
number of variables : Having too many variables in the mnodel makes the
Reducing the
• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data
scientists use special methods to reduce the number of variables but retain the maximum
amount of data.
Euclidean distance :
Euclidean distance =
of Data Science and
nsamentas Analytics (1-23) Introduction to Data Science
250 July-22 F
500 Jan-22 1
2 400 May-22 1
3 20 Dec-20 1
7 300 Dec-21 1
250 July-22
10 45 Aug-21 1
understandingof data.
EDA is used by data scientists to analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods. It helps determine how
best to manipulate data sources to get the answers user need, making it easier for data
dataset and the target variable of interest (or) using two variables and finding
more than 2.
the dataset (or) finding interactions between variables
to visuallyshow
• A box plot is a type of chart often used in explanatory data analysis
the
Box
Fig. 1.7.1
1. Minimum score :The lowest score, exlcuding
outliers.
2. Lower quartile: 25 % of scores fall below the lowerquartile value.
3. Median :The median marks the mid-point of the data and is shown by the line that
divides the box into two parts.
4. Upper quartile :75 % ofthe scores fall below the upperquartiel value.
5. Maximum score The highest : score, excluding outliers.
Boxplotof score
40
354
30
25
SCOre
204
154
101
Method 1 Method 2 Method 3 Method 4
Teaching Method
Fig. 1.7.2
b) Execution of model
model comparison.
c) Model diagnosticand
consist of the following main
Most models steps
process.
• Building a model is an iterative
enter in the model
technique and variablesto
of amodeling
I. Selection
•
3. Diagnosis
1.8.1
implement ?
untouched ?
to be easy to explain ?
3. Does the model need
• 1.8.2
Pvthon provides
Model Execution
StatsModels or Scikit-learn.These
model. For model execution.
:
b) Predictor variables have a coefficient For a linear model this is easy to interpret.
c) Predictor significance : Coefficients are great, but sometimes not enough evidence
exists to show that the influence is there.
Linear regression works if we want to predict a value, but for classify something,
classification models are used. The k-nearestneighbors method is one of the best method.
ndementals of Data Scienceand Analytics
{1-27) Introduction to Data Science
algorithmsand data
variety of data analytiCs,
exploration.
4. Alpine miner:This tool provides a GUI front end
for users to develop
workflows and interact with Big Data tools analytic
and platformson the back
end.
Open Source tools :
Suppose we have a database with house prices as the dependent variable and two
independent variables showing the square footage of the house and the number of rooms.
Now, imagine this dataset has 30 rows. The whole idea is that you build a model that can
To'train'our model or see how well it performs,we randomly subset 20 of those rows and
fit themodel. The second step is topredict the values of those 10 rows that we excluded
and predictions.
can be used to make sound business decisions
theory and various methods such as time series
analytical
Data science uses advanced
Ans. : Structured data is aranged in rows and column format. It helps for application to
structure.
•Retrieving data
Data preparation
• Data exploration
• Data modeling
• Presentation and automation.
8 What are the advantages of data repositories ?
: Advantages are as follows:
Is.
• There value to
is and analyzing
storing data.
2. Retrieving data
1. Discovery or setting the research goal.
4. Data exploration
3. Data preparation
6. Presentation and automation.
5. Data modeling
alculated asthe square root of the sum of differences between each point.