Types of Digital Data

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

1|Page

course objective
After going through this course, participant will be able to

 Understand what is data science.


 Appreciate why is data science gaining importance in today’s business
world.
 Comprehend where can data science be applied in different scenarios
across industry domains.
 Understand major components of data science stack.
 Learn how a data science project is implemented step-by-step in a
given business use-case.

Types of Digital Data:


2|Page

Methodical alignment of data points to...

What is methodical alignment of data?

To extract some observations from the data.

To do critical analysis of data.

To find patterns from the data.

All the given options.  correct


Importance of methodical alignment of data

Questions that we can answer using data…


3|Page

In various business scenarios, we can ask some pointed questions on data. From given
options, please select the correct one.

What is quality of data? --

Are there any missing values in the given data? --

Whether data is represented in “Arial” or “Times New Roman” fonts?

Whether the size of data is sufficient enough to apply data science? --

Whether data is arranged in rows or columns?

What is Data science:


4|Page

Data Science – Defined

 Data Science is an interdisciplinary field about processes and systems


to extract knowledge or insights from data in various forms, either
structured or unstructured, which is a continuation of some of the data
analysis fields such as statistics, data mining, and analytics. (Ref.
Wikipedia)
 Data Science is the empirical synthesis of actionable knowledge from
raw data through the complete data lifecycle process. (Ref. NIST - The
National Institute of Standards and Technology)
 Data Science, a cross-functional discipline, is about scientific
exploration of data to extract meaning or insight, and the construction
of software systems to utilize such insights in a business or social
context. (Ref. Hortonworks)

Components of Data Science


5|Page

Probability and Statistics in Data Science


Probability is the study of random events. Most people have an intuitive
understanding of degrees of probability, which is why you can use words like
“probably” and “unlikely” without special training, but we will talk about how
we can make quantitative claims about those degrees.
Statistics is the discipline of using data samples to support claims about
populations. Most statistical analysis is based on probability, which is why
these pieces are usually presented together.
Computation is a tool that is well-suited to quantitative analysis, and
computers are commonly used to process statistics. Also, computational
experiments are useful for exploring concepts in probability and statistics.
6|Page

 Linear Algebra is critically used in almost all peripheries of science,


practically solving most of the problems using linear models.
 Most of the complex science problems are converted into problems of
vectors and matrices and then solved with linear models.
 In the world of data (especially, big data), linear algebra can be very
handy to process huge chunks of data to accomplish many practical
transformations such as graphical transformations, face morphing,
object detection and tracking, audio and image compression, edge
detection, blurring, and signal processing.
 In data science, while solving a given business problem, an
appropriate statistical computing technique may be used.
 These algorithms while working on the data, may either use iterative
methods or linear algebra techniques for computation.
 Linear Algebra works as a computational engine for most of the data
science problems because of its performance advantages over iterative
methods.

Problem statement
 We need to transmit a message over the network: “PREPARE to
NEGOTIATE”.
 Message transmission requires encryption at transmitter and
decryption at receiver end.
 To encrypt and decrypt, we need to use a confidential piece of
information, usually referred to as a key.
7|Page

 The prime objective is to ensure confidentiality and privacy of data


during transmission.
 Solution
 Step 1: The message is encrypted by assigning a number for each
letter in the message. Thus, the message becomes:


 Step 2: The message is split into a sequence of 3x1 vectors:


 Step 3: A 3 x 3 encoding matrix is used to encrypt the message
vectors:

 Step 4: At the receiving end, the message is decrypted by multiplying


this matrix with the inverse of the encoding matrix. The inverse of the
encoding matrix is:


 Step 5: After multiplication, the original enumerated matrix will be
obtained. The original message can now be decoded from this matrix.
 Problem statement
 Currents I1, I2 and I3 need to be determined for the following electrical
network:

8|Page


 Solution
 Step 1: The equations for current are written based on Kirchhoff’s Law.


 Step 2: These equations are converted into a matrix.


 Step 3: The matrix is solved to get the values of the currents.

Problem statement
Five visitors of a social networking site are linked with each other as
depicted by the directed graph G below:
9|Page

How can we use these relationships to extract more information about them
and predict their proposed activities?
Solution
Step 1: These relationships can be converted into a relationship chart in
which “1” indicates related and “0” indicates unrelated:

Step 2: From the chart created in the previous step, the adjacency matrix
for the directed graph is:

Step 3: The adjacency matrix may be used as a data structure for


representing graphs in computer programs for manipulation.
10 | P a g e

 Linear Algebra makes scientific computing easy as most complex


equations can be converted into linear equations with help of vectors
and matrices.
 Vectors can be looked at as single dimensional matrices. Linear
algebra mainly deals with representation of data in the form of
matrices.
 Linear Algebra helps represent large sets of data in the form of
matrices which helps us visualize the data better.
 All the operations/processes performed on matrices are batch
processes. This means, though we have millions of data examples we
don't need to process each example individually. Usually, an algorithm
or a design technique is applied to the entire data set at the same time
or subsequently without focusing on the individual data examples.

Machine Learning - defined

 In 1959, Arthur Samuel defined machine learning as a "Field of study that


gives computers the ability to learn without being explicitly programmed“
 It is a branch of artificial intelligence in which a computer generates rules
underlying or based on raw data that has been fed into it. (Ref.
www.dictionary.com)

 Machine Learning is the field of scientific study that concentrates on induction


algorithms and on other algorithms that can be said to ”learn'‘. (Ref. Stanford
glossary of terms)

 A computer program is said to learn from experience E with respect to some


class of tasks T and performance measure P if its performance at tasks in T,
as measured by P, improves with experience E. (Ref. Tom M. Mitchell)

Types of Learning
11 | P a g e

Supervised machine learning model: Learning phase

A machine is taught to identify various fruits by building a model with the help of images.

Supervised machine learning model: Testing phase

A new set of images is given to this model as test data so that it can classify different fruits.
12 | P a g e

Supervised machine learning model: Types

Unsupervised machine learning model

There is a basket filled with some fresh fruits. The machine’s task is to
identify the same type of fruits based on the colors (labels).

But here, unlike supervised learning, the machine is not exposed to any
prior knowledge. So how will it arrange the same type of fruits based on
their colors?

Clustering
Machine has identified four clusters of fruits: Red, Green, Yellow, and Orange
13 | P a g e

Semi-supervised machine learning

Reinforcement learning

Reinforcement learning
14 | P a g e

Some famous machine learning algorithms

Integrating the blocks of Data Science

 To solve a given business problem, various blocks of the Data Science stack
are tightly coupled with each other
 Core algorithms need to be written in some programming language for
implementation.
 Most algorithms use the basic concepts of linear algebra.
 Statistical computations need to be done on the given data.
 Available data in structured, un-structured and semi-structured form
need to be managed through various data management systems.
 Computer Science provides us with the necessary programming languages,
database management systems, statistical analysis and machine tools.
15 | P a g e

Tools and packages for Data Science

Data Science Project Life cycle

*Ref. Practical Data Science with R by NINA ZUMEL, JOHN MOUNT, MANNING SHELTER ISLAND
16 | P a g e

A business scenario to implement Data Science

 Country Bank of India feel that they are losing too much money to bad loans
and want to reduce their losses.
 Data Science shall be able to help the bank reduce their losses from bad
loans, say by X%.

Step 1: Define the goal

The first task in a data science project is to define a measurable and


quantifiable goal. At this stage, let us learn all that we can about the context
of our project:

 Why does the business organization want the project in the first place? What
do they lack, and what do they need?
 What are they doing to solve the problem now, and why isn’t that good
enough?
 What resources will we need? What kind of data is available? Is domain
expertise available within the team? What are the computational resources
available/required?
 How does the business organization plan to deploy the derived results? What
are the constraints that have to be met for successful deployment?

Bad loan use-case: Define the goal


17 | P a g e

Step 2: Collect and manage data

This step encompasses identifying the data you need, exploring it, and
conditioning it to be suitable for analysis. This stage is often the most time
consuming step in the process. This step helps find answers to these
questions:

 What data is available to us?


 Will it help us solve the problem?
 Is the data enough to carry out analysis?
 Is the data quality good enough?

Bad loan use-case: Collect and manage data

Step 3: Build a model

In this step, we use statistics and machine learning to extract useful insights
from the data in order to achieve our goals. The most common data science
modeling tasks are as follows:

 Classification: Deciding if something belongs to one category or another


 Scoring: Predicting or estimating a numeric value, such as a price or
probability
 Ranking: Learning to order items by preferences
 Clustering: Grouping similar items based on certain parameters
 Finding relations: Finding correlations or potential causes of effects seen in
the data
 Characterization: Very general plotting and report generation from data
18 | P a g e

Bad loan use-case: Build a model

Step 4: Model evaluation

Once we have a model, we need to determine whether it meets our goals by


asking the following questions:

 Is the model accurate enough for our needs? Does it generalize well?
 Does it perform better than “the obvious guess”? Better than the estimate we
currently use?
If the answer to any of these questions is no, it’s time to go back to the
previous step, or relook whether the selected data support the goal we are
trying to achieve.

Bad loan use-case: Model evaluation

Step 5: Present results and document

 Once we have a model that meets our success criteria, we’ll present our
results to our project sponsor and other stakeholders.
19 | P a g e

 We must also document the model for those in the organization who are
responsible for using, running, and maintaining the model once it is
deployed.
 Different audiences require different kinds of information. For example, a
business-oriented audience may want to understand the impact of our
findings in terms of business metrics.

Bad loan use-case: Present results and document

Step 6: Deploy model

 Finally, the model is put into operation.


 In many organizations this means the data scientist no longer has primary
responsibility for the day-to-day operation of the model.
 But we should ensure that the model will run smoothly and won’t make
disastrous, unsupervised decisions.
 We also want to make sure that the model can be updated as its
environment changes.
 And in many situations, the model will initially be deployed in a small pilot
program
 The test might bring out issues that we didn’t anticipate and we may
have to adjust the model

Bad loan use-case: Deploy model


20 | P a g e

Characteristics of a successful Data Science project

To execute a successful data science project, we need to make sure we


have:

 a clear, verifiable and quantifiable goal


 set realistic expectations for all stakeholders
 an unbiased data set covering examples that are sufficient to represent the
entire data
 the right model based on the business scenario
 correctly represented and deployed the model in order to meet the pre-
defined goal

Top 10 use-cases of Data Science


21 | P a g e

Global challenges that data science may solve

So, let us gear up and Data Science the world around!

Refrences

 http://www.wolfram.com/
 https://en.wikipedia.org/wiki/Data_science
 https://en.wikipedia.org/wiki/Statistics
 https://en.wikipedia.org/wiki/Probability
 http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=Mac
hineLearning
 https://en.wikipedia.org/wiki/Machine_learning
 https://www.r-project.org/
 https://cran.r-project.org/
 http://www.nist.gov/
 http://hortonworks.com/
 www.techtarget.com
 http://www.businessdictionary.com/
 Think Stats - Probability and Statistics for Programmers” by Allen B. Downey,
Green Tea Press, Needham, Massachusetts
22 | P a g e

 http://robotics.stanford.edu/~ronnyk/glossary.html
 Reinforcement Learning - An Introduction, Richard S. Sutton and Andrew G.
Barto, A Bradford Book. The MIT Press, Cambridge, Massachusetts, London,
England
 Practical Data Science with R by NINA ZUMEL, JOHN MOUNT, MANNING
SHELTER ISLAND

Related:

Explore Machine Learning

Probability & Statistics

Introduction to Linear Algebra

Probability Distribution

Data Visualization

Exploratory data analysis

Introduction to Big Data Technology


Landscape

Statistical Inference

Association Analysis

You might also like