0% found this document useful (0 votes)
12 views132 pages

MODULE _ 1

The document provides an overview of machine learning (ML), its need, types, and applications, highlighting its ability to learn from data and improve over time. It discusses the relationship between ML and other fields such as artificial intelligence, data science, and statistics, as well as the learning process and key concepts like supervised learning, classification, and regression. Real-world examples illustrate the practical applications of ML in areas like self-driving cars, voice assistants, and fraud detection.

Uploaded by

sharanyawork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views132 pages

MODULE _ 1

The document provides an overview of machine learning (ML), its need, types, and applications, highlighting its ability to learn from data and improve over time. It discusses the relationship between ML and other fields such as artificial intelligence, data science, and statistics, as well as the learning process and key concepts like supervised learning, classification, and regression. Real-world examples illustrate the practical applications of ML in areas like self-driving cars, voice assistants, and fraud detection.

Uploaded by

sharanyawork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

MACHINE

LEARNING
MODULE – 1
Course Code BCS602
• Introduction
• Need for Machine Learning
• Machine Learning Explained
• Machine Learning in Relation
to other Fields
• Types of Machine Learning
TOPICS • Challenges of Machine
Learning
• Machine Learning Process
• Machine Learning
Applications
• Understanding Data – 1:
Introduction
• Big Data Analysis
Framework
• Descriptive Statistics
• Univariate Data Analysis
and Visualization.
Machine learning (ML) allows computers to learn and make
decisions without being explicitly programmed. It involves
feeding data into algorithms to identify patterns and make
predictions on new data.

https://medium.com/enjoy-algorithm/introduction-to-machine-
learning-74393e6b7b9d
Need for Machine Learning
Machine learning has become so popular because of three reasons:
1. High volume of available data to manage: Big companies such as Facebook,
Twitter, and YouTube generate huge amount of data that grows at a
phenomenal rate. It is estimated that the data approximately gets doubled
every year.
2. cost of storage has reduced. The hardware cost has also dropped.
Therefore, it is easier now to capture, process, store, distribute, and transmit
the digital information.
3. popularity of machine learning is the availability of complex algorithms now.
Especially with the advent of deep learning, many algorithms are available for
machine learning.
Understanding the Knowledge Pyramid
The Knowledge Pyramid explains how raw data is transformed into useful
knowledge and intelligence. It has five levels:
• Data (Raw Facts)
• Basic facts and numbers stored in different formats like databases or spreadsheets.
• Example: A store collects data on daily sales transactions.
• Information (Processed Data)
• Data that has been analyzed to find patterns or useful details.
• Example: Identifying the best-selling product from sales data.
• Knowledge (Condensed Information)
• Insights gained from information, which help in decision-making.
• Example: Noticing seasonal trends in sales data, like increased sales during holidays.
• Intelligence (Applied Knowledge)
• Using knowledge to take actions or make strategic decisions.
• Example: A business using sales trends to decide on stock levels or marketing strategies.
• Wisdom (Final Stage – Human Expertise)
• The ability to make the best decisions based on intelligence, experience, and judgment.
• Example: A business leader using experience and insights to create long-term strategies.
Machine Learning Explained

What is Machine Learning?


Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that allows
computers to learn from data and make predictions or decisions without
being explicitly programmed.

Arthur Samuel, one of the pioneers of AI, defined ML as:

"Machine learning is the field of study that gives computers the ability to learn
without being explicitly programmed."
How is Machine Learning Different from Traditional Programming?
Traditional Programming:
• A programmer writes a set of rules for the computer to follow.
• Example: A program for spam filtering checks emails for specific keywords
like "lottery" or "free money" and marks them as spam.
• Problem: This method cannot adapt to new types of spam emails that do
not contain predefined keywords.

Machine Learning Approach:


• Instead of manually creating rules, ML systems learn from large datasets
and improve their accuracy over time.
• Example: Spam filters in Gmail use ML algorithms to analyze thousands of
emails labeled as spam and learn patterns. Over time, it detects new types
of spam emails automatically, even if they contain different words.
Why is Machine Learning Needed?
Earlier, AI systems relied on expert-created rules to solve problems. For
example:
• MYCIN (1970s): A medical expert system that diagnosed bacterial infections
using a set of predefined rules created by doctors.
• Problem: It couldn’t handle new diseases because the rules were fixed and
couldn’t adapt to new data.
ML solves this problem by learning from data and continuously improving
its predictions.
Key Advantages of ML:
• Adapts to new data – ML systems improve over time without needing new
programming.
• Automates complex tasks – Used in facial recognition, self-driving cars, and
fraud detection.
• Handles massive data – ML processes large datasets quickly, making it ideal
for big data analysis.
Real-World Examples of Machine Learning
1. Self-Driving Cars
• Uses cameras, sensors, and ML models to analyze the road.
• Recognizes traffic signs, other vehicles, and pedestrians.
• Predicts the best actions (brake, accelerate, turn).
• Example: Tesla’s Autopilot uses ML to learn driving behavior from millions of
miles of data.
• 2. Voice Assistants (Alexa, Siri, Google Assistant)
• Uses Natural Language Processing (NLP) to understand and respond to human
speech.
• Learns from millions of conversations to improve over time.
• Example: If you say, “Play my favorite song,” Alexa learns from past
choices and picks a song you frequently listen to.
3. Recommendation Systems (Netflix, YouTube, Amazon, Spotify)
• How it works:
• ML models analyze past user behavior to suggest personalized content.
• Recognizes patterns like “people who watched X also liked Y.”
• Example:
• Netflix recommends shows based on your watch history.
• Amazon suggests products similar to your previous purchases.
4. Fraud Detection (Banks & Online Payments)
• How it works:
• Analyzes spending patterns and detects unusual transactions.
• If an ML model notices sudden large purchases from another country, it flags the
transaction as potential fraud.
• Example:
• PayPal and credit card companies use ML to prevent unauthorized transactions.
How Do ML Models Learn?
Machine learning models extract patterns from raw data and use them to
make predictions or decisions. The learning process involves several steps:
1. Collecting Data
• Just like humans gain experience by observing and learning from past
situations, computers collect large datasets to learn from.
• Example: If we want to create a spam filter, we collect thousands of emails
labeled as spam or not spam.
2. Forming Abstract Concepts
• Machines analyze the data to identify meaningful patterns. This is called
abstraction.
• Example: A spam filter might learn that emails containing words like "lottery"
or "win big" are often spam.
3. Generalization
• The model applies its learned patterns to new, unseen data.
• Example: If a new email contains phrases similar to known spam emails, the
model predicts it as spam, even if it has slightly different wording.

4. Heuristics (Rules of Thumb)


• Machines, like humans, use educated guesses (heuristics) to make quick
decisions.
• Example: A self-driving car might learn that if an object appears in front
suddenly, it should brake.
• However, heuristics can fail sometimes, so continuous learning and
corrections are needed.
Tom Mitchell's Definition of Machine Learning
Tom Mitchell, a well-known AI researcher, defined machine learning as:
"A computer program is said to learn from experience (E) with respect to some task
(T) and some performance measure (P) if its performance on T, measured by P,
improves with experience E."
Breaking It Down:
• E (Experience): The data the model learns from (e.g., thousands of images for
object detection).
• T (Task): What the model is learning to do (e.g., recognizing objects in an image).
• P (Performance Measure): How well the model is doing (e.g., accuracy, precision,
recall).
Example of Tom Mitchell’s Definition:
• Task (T): Detecting cats in images.
• Experience (E): Training data of labeled cat images.
• Performance Measure (P): Accuracy of detecting cats correctly.
• If the accuracy improves as more images are analyzed, the model is "learning."
Machine Learning in Relation to other Fields
Machine Learning and Artificial Intelligence
Relationship Between AI,
Machine Learning, and Deep
Learning

Artificial Intelligence (AI)


• The broadest field, aiming
to create machines that
can perform tasks that
typically require human
intelligence.
• AI includes rule-based
systems, expert systems,
search algorithms, and
learning-based
approaches.
Machine Learning (ML)
(Subfield of AI)
• A specific approach within
AI where systems learn
from data rather than being
explicitly programmed.
• ML models use statistical
algorithms to recognize
patterns and make
decisions.
• Includes supervised
learning, unsupervised
learning, and
reinforcement learning.
• Deep Learning (DL) (Subfield of Machine Learning)
• A specialized area of ML that uses neural networks, which are inspired by the way human
brains process information.
• Neural networks consist of multiple layers (hence the term "deep") that enable them to
recognize highly complex patterns.
• Deep learning powers speech recognition, image classification, natural language
processing, and self-driving cars.
Machine Learning, Data Science, Data Mining, and Data
Analytics
1️⃣ Data Science:
Data Science is the broadest field that involves collecting, processing,
analyzing, and interpreting data to derive insights.
It includes multiple subfields, such as big data, machine learning, data mining,
and data analytics.
• Data Collection & Storage: Gathering data from various sources like social media,
websites, and sensors.
• Data Processing & Cleaning: Ensuring that raw data is structured and useful for
analysis.
• Analysis & Visualization: Using statistical tools and ML to uncover patterns and
trends.
Real-World Example:
• Netflix’s Recommendation System:
• Netflix collects data on what users watch.
• Machine learning algorithms analyze this data to recommend new shows based on user
preferences.
2️⃣ Big Data: Handling Large-Scale Data
Big Data refers to massive volumes of data generated by various digital
sources, including social media, IoT devices, and e-commerce transactions.
Characteristics (3Vs of Big Data):
• Volume: Gigantic amounts of data from sources like Facebook, YouTube, and
Twitter.
• Variety: Data in different formats (text, images, audio, video, etc.).
• Velocity: The speed at which data is created and processed in real time.
Real-World Example:
• Google Search Predictions:
• Google collects massive amounts of search data daily.
• Big data analytics helps in predicting the most relevant search results for users.
3️⃣ Data Mining: Extracting Hidden Patterns

Data mining is the process of discovering hidden patterns in


large datasets.
It is often considered a part of machine learning, but its primary
focus is on exploration and extraction of patterns rather than
prediction.

Real-World Example:
• Amazon’s Customer Insights:
• Amazon uses data mining to analyze customer purchase history.
• This helps in identifying shopping trends and recommending products
accordingly.
4️⃣ Data Analytics: Extracting Actionable Insights

Data analytics involves analyzing raw data to find actionable insights that
can be used for decision-making.

Types of Data Analytics:


• Descriptive Analytics: Summarizes historical data (e.g., company sales
reports).
• Predictive Analytics: Uses ML to forecast future trends (e.g., stock market
prediction).
• Prescriptive Analytics: Suggests actions based on data (e.g.,
recommending inventory restocking for businesses).
5️⃣ Pattern Recognition: Understanding Patterns in Data

Pattern recognition is an engineering-focused application of machine


learning that identifies patterns in data for classification.
Real-World Example:
• Facial Recognition in Smartphones:
• Apple’s Face ID scans a user's face, recognizing unique features to
unlock the phone.
How It Differs from Machine Learning:
• Pattern Recognition: Focuses only on identifying and classifying patterns
in images, text, or signals.
• Machine Learning: Involves learning from data and making decisions based
on it.
Machine Learning and Statistics
Statistics is a branch of mathematics that has a solid theoretical foundation
regarding statistical learning. Like machine learning (ML), it can learn from data. But
the difference between statistics and ML is that statistical methods look for
regularity in data called patterns. Initially, statistics sets a hypothesis and performs
experiments to verify and validate the hypothesis in order to find relationships among
data.

How Machine Learning and Statistics Work Together


Some people argue that Machine Learning is a modern evolution of Statistics
because ML relies on statistical principles.
In fact, many ML algorithms (like Linear Regression, Logistic Regression,
Bayesian models) are based on statistical methods.
Example of Overlap:
• Credit Scoring in Banks:
• Statistics helps analyze past credit history and identify risk factors.
• Machine Learning predicts whether a person will default on a loan.
TYPES OF MACHINE LEARNING
Labelled and Unlabelled Data
Data is a raw fact. Normally, data is represented in the form of a table. Data
also can be referred to as a data point, sample, or an example. Each row of the
table represents a data point. Features are attributes or characteristics of an
object. Normally, the columns of the table are attributes. Out of all attributes,
one attribute is important and is called a label. Label is the feature that we aim
to predict. Thus, there are two types of data – labelled and unlabelled.

Labelled Data To illustrate labelled data, let us take one example dataset
called Iris flower dataset or Fisher’s Iris dataset. The dataset has 50 samples of
Iris – with four attributes, length and width of sepals and petals. The target
variable is called class. There are three classes – Iris setosa, Iris virginica, and
Iris versicolor.
Supervised Learning: Learning with Answers
In supervised learning, the model learns from labelled data (data with correct
answers).
The goal is to predict labels for new, unseen data.
Example:
Imagine a teacher giving students a math problem and the correct answer. The
students learn from these examples and solve similar problems on their own.
Types of Supervised Learning:
Regression – Predicts continuous values (e.g., house prices, temperature).
Classification – Predicts categories (e.g., spam vs. non-spam emails, dog vs. cat
images).
Real-World Use Cases:
Spam Detection – Classifies emails as spam or not.
Disease Prediction – Predicts if a patient has a disease based on symptoms.
Stock Market Prediction – Predicts stock prices.
CLASSIFICATION
What is Classification?
Classification is a supervised learning technique used to predict labels
(categories) for new data. It works by learning from labelled data and then
using that knowledge to classify new, unseen data.
Example:
Imagine you have a set of pictures of cats and dogs, where each image is
labelled as either "cat" or "dog." A classification model learns from these
images and can later identify whether a new, unknown image is a cat or a
dog.
How Does Classification Work?
The classification process happens in two stages:
1️⃣ Training Stage
• The model learns from a dataset where each data point is labelled.
• Example: A dataset of animals where each image has a label ("cat" or "dog").
• The model builds a classification structure to understand patterns.
2️⃣ Testing Stage
• The trained model is given new, unseen data.
• It predicts the correct label based on what it learned.
• Example: If given an unknown animal image, the model classifies it as a "cat"
or "dog."
Example with the Iris Dataset:
• Suppose the Iris dataset contains flower details like petal length, petal width,
etc.
• If we provide new flower data (6.3, 2.9, 5.6, 1.8, ?), the classification model
predicts the flower type.
Some of the key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naïve Bayes
• Artificial Neural Network and Deep Learning networks like CNN
Regression Models
What is Regression?
Regression is a supervised learning technique used to predict
continuous values (numbers). Unlike classification (which
predicts categories like "cat" or "dog"), regression predicts
numerical values like sales, temperature, or house prices.
Example:
predict future sales of a product based on previous weeks' sales?

we can use regression to find a pattern in the data and make


accurate predictions.
How Regression Works?
A regression model tries to find a mathematical relationship between input
(x) and output (y). It fits a line or curve to the data points.
For example, in Figure 1.8, the relationship between weeks (x) and product
sales (y) is represented by the equation:
y=0.66x+0.54
x = input variable (weeks)
y = output variable (product sales)
0.66 = slope (rate of increase in sales per week)
0.54 = intercept (starting value when x = 0)
Prediction Example:
If we want to predict sales in the 8th week, we substitute x = 8 in the
equation:
y=(0.66×8)+0.54=5.82
So, the model predicts that in the 8th week, sales will be around 5.82 units.
Regression vs Classification

Feature Regression Classification

Continuous values (e.g., price, Discrete labels (e.g., "cat" or


Output Type
temperature) "dog")

Example Predicting house prices Identifying spam emails

A label (e.g., "Spam" or "Not


Algorithm Output A number (e.g., 50.6°C, $500)
Spam")
Unsupervised Learning
The second kind of learning is by self-instruction. As the name
suggests, there are no supervisor or teacher components. In the
absence of a supervisor or teacher, self-instruction is the most
common kind of learning process. This process of self-instruction is
based on the concept of trial and error.

Here, the program is supplied with objects, but no labels are


defined. The algorithm itself observes the examples and recognizes
patterns based on the principles of grouping. Grouping is done in
ways that similar objects form the same group.

Cluster analysis and Dimensional reduction algorithms are


examples of unsupervised algorithms.
Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group
objects into disjoint clusters or groups. Cluster analysis clusters objects
based on its attributes. All the data objects of the partitions are similar in some
aspect and vary from the data objects in the other partitions significantly.

Some of the examples of clustering processes are — segmentation of a region


of interest in an image, detection of abnormal growth in a medical image.

An example of clustering scheme is shown in Figure 1.9 where the clustering


algorithm takes a set of dogs and cats images and groups it as two clusters-
dogs and cats. It can be observed that the samples belonging to a cluster are
similar and samples are different radically across clusters.
Some of the key clustering algorithms are:
• k-means algorithm • Hierarchical algorithm

Dimensionality Reduction
Dimensionality reduction algorithms are examples of unsupervised
algorithms. It takes a higher dimension data as input and outputs the
data in lower dimension by taking advantage of the variance of the
data. It is a task of reducing the dataset with few features without
losing the generality.
Differences between Supervised and Unsupervised Learning

Semi-supervised Learning
There are circumstances where the dataset has a huge collection of
unlabelled data and some labelled data. Labelling is a costly
process and difficult to perform by the humans. Semi-supervised
algorithms use unlabelled data by assigning a pseudo-label. Then,
the labelled and pseudo-labelled dataset can be combined.
Reinforcement Learning
Reinforcement Learning (RL) is a type of machine learning
where an agent learns by interacting with an environment
to achieve a goal.
Agent: The learner (it could be a robot, program, or even
a human).
Environment: The world where the agent operates.
Actions: The choices an agent can make.
Rewards & Punishments: The feedback an agent
receives for its actions.
The goal of RL is to maximize rewards over time!
How Does It Work?
Just like how humans learn from trial and error, RL agents learn by taking
actions and getting feedback (positive rewards or negative punishments).
Example: Learning to Play a Grid Game
In the Grid Game (Figure 1.10):
Goal: Reach the target tile.
Danger: Avoid stepping on danger tiles.
Block: Some paths are blocked.
Actions: Move left, right, up, or down.
The agent starts from the bottom-left tile and tries different moves.
If it steps into danger, it gets a negative reward (punishment).
If it moves closer to the goal, it gets a positive reward.
Over time, the agent learns the best path to the goal through experience.
https://www.geeksforgeeks.org/what-is-reinforcement-learning/
CHALLENGES OF MACHINE LEARNING
Computers are better than humans in performing tasks like computation. For
example, while calculating the square root of large numbers, an average
human may blink but computers can display the result in seconds. Computers
can play games like chess, GO, and even beat professional players of that
game.

However, humans are better than computers in many aspects like recognition.
But, deep learning systems challenge human beings in this aspect as well.
Machines can recognize human faces in a second. Still, there are tasks where
humans are better as machine learning systems still require quality data for
model construction. The quality of a learning system depends on the quality of
data.
Some of the challenges are listed below
1. Problems – Machine learning can deal with the ‘well-posed’ problems
where specifications are complete and available. Computers cannot solve
‘ill-posed’ problems. Consider one simple example (shown in Table 1.3)

Can a model for this test data be multiplication? That is, y = x1 × x2 . Well! It is true! But, this
is equally true that y may be y = x1 ÷ x2 , or y = x1^x2. So, there are three functions that fit the
data. This means that the problem is ill-posed. To solve this problem, one needs more
example to check the model. Puzzles and games that do not have sufficient specification
may become an ill-posed problem and scientific computation has many ill-posed problems.
2. Huge Data
• Machine learning needs a lot of data to learn properly.
• The data should be high quality—it should not have missing values or incorrect
information.
• Example: If you are teaching a computer to recognize dogs, but half of your images are
missing labels, the model will not learn properly.
3. High Computation Power
• More data means more processing power is needed.
• Special hardware like GPUs (Graphics Processing Units) and TPUs (Tensor Processing
Units) help speed up calculations.
• Example: Training a self-driving car model requires processing millions of images, which
needs powerful computers.
4. Complexity of Algorithms
• There are many different machine learning algorithms, and choosing the right one is
difficult.
• Data scientists must compare, test, and tune different algorithms to get the best results.
• Example: If you are building a recommendation system (like Netflix or YouTube), you must
test multiple algorithms to see which one gives the best movie recommendations.
5. Bias/Variance Problem
• Bias: The model is too simple and does not learn well (underfitting).
• Variance: The model is too complex and learns too much from the training
data but fails on new data (overfitting).
• Example:
• Underfitting: A model that always predicts "the temperature is 20°C" no matter the
weather. It does not learn well.
• Overfitting: A student memorizing answers instead of understanding concepts—does
well in practice but fails in real exams.
MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business
organizations is CRISP-DM. Since machine learning is like data mining, except
for the aim, this process can be used for machine learning. CRISP-DM stands
for Cross Industry Standard Process – Data Mining. This process involves six
steps. The steps are listed below in Figure 1.11.
1. Understanding the Business
• Before analyzing data, you need to understand the problem the business is
facing.
• You also define what kind of solution is needed.
• Example: A retail store wants to predict which customers are likely to return and
shop again.
2. Understanding the Data
• You collect all the data and study its characteristics.
• You look for patterns and form a hypothesis about what might be happening.
• Example: The store collects data on customer purchases and looks for patterns
(e.g., do customers who buy shoes also buy socks?).
3. Preparing the Data
• Raw data often has missing values or errors, so it must be cleaned.
• Missing or incorrect data can lead to wrong predictions.
• Example: If half the customer purchase history is missing, your prediction model
4. Modeling
• You apply data mining algorithms to the cleaned data to build a model that
finds patterns.
• Example: A model might predict which customers will return based on their
past shopping habits.
5. Evaluating the Model
• You test the model to see how well it performs.
• The model should make accurate predictions to be useful.
• Example: If the model predicts that a customer will return, but they don’t,
then the model needs improvement.
6. Deploying the Model
• Once the model is working well, it is used in the real world to improve
decision-making.
• Example: The store uses the model to send discount coupons to customers
likely to return.
MACHINE LEARNING APPLICATIONS
• Machine learning is used everywhere today, making daily tasks easier. Here
are some common applications:

1. Sentiment Analysis (Understanding Emotions in Text)


• Machine learning helps analyze text and determine if it expresses
happiness, sadness, anger, or positivity/negativity.
• Example:
• When you read movie or product reviews, the system can automatically assign a 5-
star or 1-star rating based on the words used.
• Emojis ( , , ) can be used to show emotions detected from text.
2. Recommendation Systems (Personalized Suggestions):
• These systems suggest things you might like based on your past behavior.
• Examples:
• Amazon recommends products similar to what you have bought.
• Netflix & YouTube suggest shows or videos you may enjoy.
• Spotify recommends songs based on your listening history.
3. Voice Assistants (AI-Powered Helpers)
• Voice assistants listen to your speech and perform tasks based on commands.
• Examples:
• Amazon Alexa, Google Assistant, Apple Siri, and Microsoft Cortana can answer questions, set
reminders, and even control smart home devices.
• You say "Hey Siri, what's the weather?", and it gives you the forecast.
4. Navigation & Ride-Hailing Apps (Finding the Best Routes)
• Apps like Google Maps and Uber use machine learning to:
• Find the shortest route to your destination.
• Predict traffic conditions to suggest faster paths.
• Estimate how long your ride will take.
• Example: Uber calculates the fastest route and even adjusts prices based on
demand (surge pricing).
Understanding Data
What is Data?
• Data is any fact or piece of information that can be stored and used.
• In computers, data is stored in bits (0s and 1s) and can represent:
• Numbers (e.g., 100, 45.6)
• Text (e.g., "Hello")
• Images (e.g., a photo)
• Audio & Video (e.g., songs, movies)
How is Data Measured?
• Data is stored in bytes, and its size increases as follows:
• 1 Byte = 8 Bits (smallest unit)
• 1 Kilobyte (KB) ≈ 1,000 Bytes
• 1 Megabyte (MB) ≈ 1,000 KB
• 1 Gigabyte (GB) ≈ 1,000 MB
• 1 Terabyte (TB) ≈ 1,000 GB
• 1 Exabyte (EB) ≈ 1,000,000 TB (huge amount of data!)
Where is Data Stored?
• Data can be stored in:
• Flat files (like simple text files)
• Databases (structured collections of data)
• Data warehouses (large-scale storage for analysis)
Types of Data
1.Operational Data → Used for daily business activities
1. Example: Sales transactions, customer orders, employee attendance
2.Non-Operational Data → Used for analysis and decision-making
1. Example: Reports analyzing last year’s sales trends
Why is Data Important?
• Raw data alone is meaningless.
• It becomes useful only when processed into information
Elements of Big Data
• Data whose volume is less and can be stored and processed by a small-scale
computer is called 'small data’.

Characteristics of Big Data:

1️⃣ Volume → Size of Data


Volume Since there is a reduction in the cost of storing devices, there has
been a tremendous growth of data. Small traditional data is measured in terms
of gigabytes (GB) and terabytes (TB), but Big Data is measured in terms of
petabytes (PB) and exabytes (EB). One exabyte is 1 million terabytes.
2️⃣ Velocity → Speed of Data Generation
The fast arrival speed of data and its increase in data volume is noted as
velocity. The availability of IoT devices and Internet power ensures that the data
is arriving at a faster rate. Velocity helps to understand the relative growth of
big data and its accessibility by users, systems and applications.
3️⃣ Variety → Different Types of Data
• Form: Data can be text, images, audio, video, graphs, maps, etc.
• Function: Includes data from transactions, conversations, archives, etc.
• Source: Data can come from public data, social media, and multimodal
sources
4️⃣ Veracity → Trustworthiness & Accuracy of Data
• Data may contain errors, biases, or inconsistencies.
• Example: Misinformation on social media reduces data reliability.
5️⃣ Validity → Relevance & Accuracy of Data for Decision-Making
• Ensures that data is useful and meaningful for a specific purpose.
• Example: Medical diagnosis data should be highly accurate to avoid errors.

6️⃣ Value → Importance of Data for Decision-Making


• Data must provide useful insights to be valuable.
• Example: Analyzing customer purchase history helps in targeted marketing.

Precision, Bias & Accuracy in Big Data


• Precision → How close repeated measurements are to each other.
• Bias → Systematic errors due to faulty algorithms or incorrect assumptions.
• Accuracy → How close data is to the true value.
Types of Big Data

1️⃣ Structured Data – Well-organized data stored in tables.


2️⃣ Unstructured Data – Data without a fixed format, like images and videos.
3️⃣ Semi-Structured Data – A mix of structured and unstructured data, like
JSON or XML files.

1️⃣ Structured Data


• Data stored in tables (like in databases).
• Can be easily searched and retrieved using tools like SQL.
Examples in Machine Learning:
Record Data – A dataset where each row is an object (like a customer) and each
column is a feature (like age, income).
Data Matrix – A table where all values are numbers, allowing mathematical
operations.
Graph Data – Data that shows relationships, like webpages linked together.
Ordered Data – Data with a specific order, such as time-based sales records.
Types of Ordered Data:
Temporal Data – Data based on time (e.g., customer shopping trends during holidays).
Sequence Data – Data without timestamps but arranged in order (e.g., DNA sequences: ATGC).
Spatial Data – Data related to locations (e.g., maps and GPS data).

2️⃣ Unstructured Data (No Fixed Format)


• Includes: Text, images, videos, audio, and blog posts.
• Difficult to store in tables because it doesn’t follow a structured format.
• Example: A YouTube video contains audio, text (captions), and images all
combined.
• Fun Fact: 80% of all data is unstructured!

3️⃣ Semi-Structured Data (Partly Organized)


• Not fully structured but has some organization.
• Example: JSON, XML, and RSS feeds.
• Common in: Web APIs, emails, and configuration files.
Data Storage and Representation
Once data is collected, it needs to be stored properly so that it can be used
for analysis. The goal of data storage is to make data easy to access and
process. Different types of storage methods are used, ranging from simple
files to large databases.
1️⃣ Flat Files (Simple Text-Based Storage)
• The simplest and cheapest way to store data.
• Stores data in plain text format.
• Best for small datasets, but not suitable for large datasets because small changes
can affect results.
• Common Flat File Formats:
CSV Files (Comma-Separated Values)
• Data is stored in a table format, with values separated by commas (,)
2️⃣ Database Systems (More Organized Storage)
• A structured way to store data for efficient retrieval and management.
• It consists of:
Database files – Contain the actual data and metadata (information about
data).
DBMS (Database Management System) – A software that helps store, retrieve,
and manage data efficiently.
Example: Relational Databases (SQL-based databases like MySQL,
PostgreSQL, and Oracle)
• Data is stored in tables (like an Excel sheet).
• Each table has columns (attributes) and rows (records/tuples).
• SQL (Structured Query Language) is used to manage and retrieve data.
Different Types of Data Bases
1. Transactional Database
Stores transactional records, where each record represents a transaction.
Includes details like time stamp, identifier, and a set of items linked to other tables.
Used for associational analysis (finding patterns and relationships in transactions).
Example:
• Banking Systems: Records of deposits, withdrawals, and payments.
• E-commerce: Stores customer purchase history (e.g., Amazon transactions).

2. Time-Series Database
Stores time-related data (data collected over time).
Data is organized based on timestamps (hourly, daily, weekly, etc.).
Helps track trends and patterns over time.
Example:
• Stock Market Data – Records daily price changes of stocks.
• Weather Monitoring – Tracks temperature, humidity, and rainfall over time.
• Website Traffic Logs – Stores number of website visitors per hour/day.
3. Spatial Database
Stores location-based (spatial) data in two formats:
• Raster Format → Uses bitmaps or pixel maps (e.g., satellite images).
• Vector Format → Stores geometric shapes like points, lines, and polygons (e.g.,
maps).
Example:
• Google Maps & GPS Navigation – Stores geographic locations, routes, and places.
• Weather Maps – Uses raster images to display temperature, pressure, etc.

4. World Wide Web (WWW) Databases


Stores and processes web-based data from websites, search engines, and
online platforms.
Used in data mining to extract useful patterns from web pages.
Example:
• Google Search Index – Stores and organizes billions of web pages.
• E-commerce Recommendations – Amazon mines web data to recommend
products
5. Data Stream Database
Dynamic & real-time data that continuously flows in and out of the system.
Data comes in high volume and at a fast speed.
Example:
• Live Stock Market Prices – Real-time data streaming for stock trading.
• Social Media Feeds – Twitter live feed updates.
• IoT Sensor Data – Continuous readings from smart devices (e.g.,
temperature sensors).
6.RSS Feeds (Really Simple Syndication)
Used to share instant updates across services.
Helps users subscribe to news, blogs, or podcasts.
Example:
• News Websites – RSS feeds for real-time news updates.
• Podcast Subscriptions – Apps like Spotify use RSS for new episode alerts.

7. JSON Database (JavaScript Object Notation)


Lightweight data format for storing and exchanging data.
Widely used in machine learning and APIs.
Example:
• Web Applications – Stores and transmits data between a client and a server.
• AI & Machine Learning Models – Stores model configurations and results.
Big Data Analytics and Types of Analytics
The goal of Big Data Analytics is to help organizations make better
decisions. It takes raw data, processes it, and extracts useful
insights.

Difference Between Data Analysis & Data Analytics


Data Analysis → Focuses on examining historical data to extract
useful information.
Data Analytics → Includes data collection, preprocessing,
analysis, and predictions for future decisions.
Data Analysis is a part of Data Analytics, but Data Analytics
has a broader scope, including predictive and prescriptive
analysis.
4 Types of Data Analytics
1️⃣ Descriptive Analytics ("What Happened?")
Summarizes past data to find patterns.
Uses statistics & visualization (charts, graphs, reports).
Example: A company finds that sales increased by 20% last quarter.
2️⃣ Diagnostic Analytics ("Why Did It Happen?")
Explains the cause of past events.
Finds relationships between different factors.
Example: A store analyzes why sales dropped in one region—maybe due to
low stock or bad weather.
3️⃣ Predictive Analytics ("What Will Happen?")
Uses machine learning & statistical models to forecast future trends.
Finds patterns in historical data to predict outcomes.
Example: An e-commerce website predicts which products will be in demand
next month.
4️⃣ Prescriptive Analytics ("What Should We Do?")
Gives recommendations based on predictive insights.
Helps organizations take actions to optimize decisions.
Example: A company optimizes ad spending based on predicted customer
behavior.
Big Data Analysis Framework
• Big Data analysis follows a structured framework to handle large volumes of
data efficiently. This framework is divided into four layers, each with a
specific function.
4-Layer Big Data Framework
• 1️⃣ Data Connection Layer
Collects and imports raw data from various sources.
Uses ETL (Extract, Transform, Load) to process data.
• Example: Data from social media, sensors, and websites is converted
into a structured format.
• 2️⃣ Data Management Layer
Stores, organizes, and manages data for fast access.
Allows parallel processing (running multiple queries at the same time).
Uses data warehouses or on-demand data retrieval methods.
• Example: A bank stores millions of transactions and retrieves them
instantly when needed.
3️⃣ Data Analytics Layer
Performs statistical analysis and machine learning.
Builds and validates ML models for predictions and decision-making.
• Example: A company predicts customer behavior based on past purchase history.
4️⃣ Presentation Layer
Displays results using dashboards, graphs, and reports.
Helps businesses interpret insights for decision-making.
• Example: Google Analytics shows website traffic trends using visual reports.
Big Data Processing Cycle
To process and analyze data efficiently, Big Data follows these steps:
Data Collection – Gathering raw data from various sources.
Data Preprocessing – Cleaning and preparing data for analysis.
Applying Machine Learning – Using ML algorithms for predictions
and insights.
Interpreting & Visualizing Results – Presenting data insights using
charts and reports.
Data Collection
Data collection is the first and most important step in Big Data analytics.
High-quality data leads to better analysis and predictions. However,
collecting "good data" is challenging.

What is Good Data?


Timeliness – Data should be up-to-date and not outdated.
Relevance – Data should be useful for machine learning or data mining.
Understandability – Data should be clear, well-organized, and free from
bias.
Example:
If a company wants to predict customer buying behavior, it should use
recent sales data, not outdated data from years ago.
Types of Data Sources
Data comes from various sources. It can be categorized into three main types:
1️⃣ Open/Public Data Sources (Freely Available Data)
No copyright restrictions – Anyone can use it.
Common sources:
• Government Data (e.g., Census data, weather reports).
• Scientific Data (e.g., Genomic & biological data).
• Healthcare Data (e.g., Patient records, medical research).
Example: Researchers use COVID-19 patient records to analyze the virus's
spread.
2️⃣ Social Media Data (User-Generated Data)
Collected from platforms like Twitter, Facebook, Instagram, YouTube.
Generates huge amounts of data daily.
Useful for trend analysis, customer sentiment, and marketing strategies.
Example:
• A company analyzes Twitter comments to see if customers like a new product.
3️⃣ Multimodal Data (Multiple Data Types)
Includes text, images, videos, and audio together.
Found in image databases, websites, and archives.
Data is often heterogeneous (different formats).
Example:
• YouTube videos contain video, audio, comments, and subtitles—all
different data types.
Data Preprocessing
In the real world, raw data is often messy or "dirty", meaning it contains
errors, missing values, and inconsistencies. Before using this data for
machine learning or data mining, it must be cleaned and organized to
improve accuracy.
Steps in Data Preprocessing
1️⃣ Data Cleaning – Detecting and fixing errors, missing values, and
duplicates.
2️⃣ Data Wrangling – Formatting the data to be usable for machine
learning.
3️⃣ Noise Removal – Filtering out random distortions (e.g., removing
unwanted characters).
4️⃣ Outlier Detection & Handling – Identifying and fixing extreme values.
Example:
• If a person’s age is recorded as 200, the system can remove or correct it.
• If salary values are missing, they can be filled with an average salary value.
Missing Data Analysis
When dealing with real-world data, missing values are a common
problem. If missing data is not handled properly, it can lead to
inaccurate machine learning models and biased results.

What is Missing Data Analysis?


Missing Data Analysis is the first step in data cleaning.
The goal is to fill in missing values, remove noise and outliers,
and correct inconsistencies to improve data quality.
Why is it important?
Prevents biased predictions in machine learning.
Helps in avoiding model overfitting.
Removing Noisy & Outlier Data
Noise is random errors or distortions in data.
Outliers are values that differ significantly from the rest of the data.
Example of Outliers:
• A dataset contains ages of employees: [25, 30, 35, 120, 40].
• The value 120 is an outlier and likely an error.

Techniques to Remove Noise & Outliers:


Binning – Group data into bins (small groups) and smooth values.
Smoothing by Mean – Replace each bin’s values with their average.
Smoothing by Median – Replace values with the middle value of the bin.
Smoothing by Boundaries – Replace extreme values with the nearest
boundary (max/min).
Data Integration & Data Transformation
When working with large datasets, data often comes from multiple sources
(e.g., databases, spreadsheets, APIs). To make it useful for machine learning
or data mining, we need to:
Merge the data (Data Integration)
Standardize the data (Data Transformation)

What is Data Integration?


Combines data from multiple sources into a single dataset.
Helps in removing duplicate and redundant data.
Makes it easier for data mining and analysis.
Example:
• A company collects customer data from website sign-ups, store
purchases, and mobile apps.
• Data integration merges these sources to create a single customer profile.
Challenges in Data Integration:
Duplicate Records – The same customer may appear multiple times.
Inconsistent Formats – One system may store "Date of Birth" as DD/MM/YYYY,
while another uses MM-DD-YYYY.
What is Data Transformation?
Once data is integrated, it must be transformed into a suitable format for
machine learning.
Converts raw data into a usable format.
Removes inconsistencies and scales values properly.
Normalization (Scaling Data)
Why Normalize?
• Different features in a dataset may have different scales.
• Example: Salary (in thousands) vs. Age (in years) → Salary values will
dominate.
• Normalization brings all values into the same range (e.g., 0 to 1).
Z-Score Normalization
Ex1: Consider the mark list V={10,20,30}, convert the
marks to Z-score
What is Standard Deviation?
• Standard deviation (σ) is a measure of how spread out the numbers
in a dataset are. It tells us how much the values in a dataset deviate
from the mean.
• If the standard deviation is small, the data points are closely
packed around the mean (less variation).
If the standard deviation is large, the data points are more
spread out from the mean (high variation).
Descriptive Statistics
Descriptive statistics is a way to summarize and describe a dataset using
numbers and visualizations. It helps us understand the data before
applying machine learning algorithms.

What is Descriptive Statistics?


Summarizes large datasets into meaningful insights.
Helps in understanding patterns and trends in data.
Does NOT make predictions—it only describes what the data looks like.
Example:
• A company has customer sales data for a year.
• Descriptive statistics can show:
• Average sales per month
• Most popular product
• Highest and lowest sales values
What is Exploratory Data Analysis (EDA)?
EDA is the first step in data analysis, where we examine data to:
Detect patterns & trends
Identify missing values & outliers
Choose the best machine learning techniques
Example:
• A bank analyzing customer transactions might use EDA to find fraudulent
activities by checking for unusual spending patterns.
1. Categorical (Qualitative) Data
• These are descriptive data that represent categories or labels. They do not have
numerical meaning.
Categorical data is divided into:
a) Nominal Data
• These are just labels or names without any order.
• You cannot compare or perform mathematical operations like addition or
subtraction.
Examples:
• Colors: Red, Blue, Green
• Gender: Male, Female, Other
• Blood Type: A, B, AB, O
• Nationality: Indian, American, Canadian
Example of what you CANNOT do:
• Saying "Red is greater than Blue" makes no sense!
• Taking an average of Blood Types (A + B) is meaningless.
b) Ordinal Data
• This data has a natural order or ranking, but the differences between values
are not meaningful.
• We cannot do arithmetic operations (addition, subtraction) on them.
Examples:
• Movie Ratings: 1 star , 2 stars , 3 stars
• Education Level: High School, Bachelor's, Master's, PhD
• Customer Satisfaction: Poor, Average, Good, Excellent
• Spiciness of Food: Mild , Medium, Hot
Example of what you CANNOT do:
• "Good - Poor = Average" does not make sense!
• Difference between "Medium" and "High" fever is not measurable in
numbers.
2. Numerical (Quantitative) Data
• These are measurable numbers that represent quantities.
Numerical data is divided into:
a) Interval Data
• The difference between values is meaningful, but there is no true zero.
• Zero does NOT mean "nothing."
• Only addition (+) and subtraction (-) are meaningful.
Examples:
• Temperature (Celsius, Fahrenheit): 30°C, 40°C, 50°C (But 0°C does NOT
mean "no temperature"!)
• Dates & Time: Year 2000, 2010, 2020 (Years are measured, but 0 AD does not
mean "no time.")
Example of what you CANNOT do:
• Saying "40°C is twice as hot as 20°C" is wrong because temperature does
not have a true zero.
b) Ratio Data
• The difference AND ratio between values are meaningful.
• Zero means "nothing."
• We can perform all arithmetic operations ( +, -, ×, ÷ ).
Examples:
• Height: 0 cm means no height. A person with 180 cm is twice as tall as
a person with 90 cm.
• Weight: 0 kg means no weight. A 100 kg object is twice as heavy as a
50 kg object.
• Salary: $0 means no salary. A person earning $4000 earns twice as
much as someone earning $2000.
• Speed: 0 km/h means no movement. A car moving at 80 km/h is twice
as fast as one at 40 km/h.
Univariate Data Analysis & Visualization
Univariate analysis is the simplest type of data analysis because it
examines only one variable at a time. It does not look at relationships
between variables instead, it describes data and finds patterns.
Example:
• Analyzing students’ test scores (only one variable: marks).
• Checking monthly sales revenue (only one variable: sales).

Components of Univariate Data Analysis


Frequency Distributions – Shows how often each value appears in the
dataset.
Central Tendency Measures – Mean, Median, and Mode.
Dispersion (Variation) – Range, Variance, and Standard Deviation.
Shape of Data – Skewness and Kurtosis.
Data Visualization
Central Tendency
Why Do We Need Central Tendency?
We can't remember all data points, so we summarize
data using central tendency.
Helps find a single representative value for a dataset.
Useful for comparison and decision-making in data
analysis.
The three main measures of central tendency are:
1️⃣ Mean (Average)
2️⃣ Median (Middle Value)
3️⃣ Mode (Most Frequent Value)
Dispersion
What is Dispersion?
Dispersion measures how spread out data is around
the central tendency (mean, median, or mode).
If the data points are close together, dispersion is low;
if they are far apart, dispersion is high.
It helps us understand variability in a dataset.
Example:
• Dataset 1: [18, 19, 20, 21, 22] → Low dispersion (values are
close together).
• Dataset 2: [5, 10, 20, 35, 50] → High dispersion (values are
spread out).
SHAPE
SKEWNESS
Skewness measures the direction and degree of asymmetry in a dataset. Ideally, a perfectly symmetrical
dataset has zero skewness, meaning it follows a normal distribution. However, real-world data often have
some skewness.
Types of Skewness (Figure 2.8)
1.Positive Skewness (Right-Skewed Distribution)
1. The tail is longer on the right side.
2. The dataset has higher values (outliers)
that pull the mean to the right.
1. Mean > Median > Mode
2. Example: Income distribution (a few people earn very high salaries).
2.Negative Skewness (Left-Skewed Distribution)
1. The tail is longer on the left side.
2. The dataset has lower values (outliers)
that pull the mean to the left.
1. Mean < Median < Mode
2. Example: Age at retirement (most people retire at a similar age, but some retire much earlier).
Impact of Skewness on Data Analysis
• If a dataset is skewed, it has a higher chance of outliers, which can affect
statistical calculations and machine learning models.
• Symmetrical Data (No skewness) → Mean = Median = Mode
• Positively Skewed Data → Mean is greater than the median.
• Negatively Skewed Data → Median is greater than the mean.
KURTOSIS
Kurtosis measures the peak and tail heaviness of a distribution compared to
a normal distribution. It helps to understand whether data has extreme outliers
or follows a more uniform distribution.
Types of Kurtosis
1.Leptokurtic (High Kurtosis)
1. Very peaked distribution with heavy tails (many extreme values).
2. More outliers are present.
3. Example: Financial stock returns with occasional extreme gains or losses.
2.Mesokurtic (Normal Kurtosis)
1. The same kurtosis as a normal distribution.
2. No extreme peaks or heavy tails.
3. Example: Heights of adults in a population.
3.Platykurtic (Low Kurtosis)
1. Flat-topped distribution with light tails (fewer extreme values).
2. Fewer outliers.
3. Example: Uniform distribution of test scores.
Special Univariate Plots
• Stem-and-Leaf Plot
• A Stem-and-Leaf Plot is a simple way to visualize numerical data and
understand its distribution.
• Each number is split into two parts:
• Stem: Represents the leading digits (tens place).
• Leaf: Represents the last digit (ones place).
• Example:
• The number 45 is divided into stem = 4 and leaf = 5.
Why Use It?
• It helps to see how data is spread out.
• It shows how frequently certain values appear.
Q-Q (Quantile-Quantile) Plot
• A Q-Q Plot is a graph that helps check if a dataset follows a normal
distribution (bell-shaped curve).
• It compares:
• Actual Data Quantiles (values from the dataset).
• Theoretical Normal Quantiles (expected values if the data were normally
distributed).
• If the data follows a normal distribution:
• The points will fall along a straight diagonal line (45° line).
• If not:
• The points will deviate from the line, indicating a different distribution.
The normal Q-Q plot of marks x={13 11 2 3 4 8 9}

You might also like