UNIT-I
UNIT-I
UNIT-I:
Because of its learning and decision-making abilities, machine learning is often referred to as AI, though,
in reality, it is a subdivision of AI.
Until the late 1970s, it was a part of AI’s evolution. Then, it branched off to evolve on its own.
Machine learning has become a very important response tool for cloud computing and e-commerce, and
is being used in a variety of cutting-edge technologies.
Below is a brief history of machine learning and its role in data management.
Machine learning is a necessary aspect of modern business and research for many organizations today.
It uses algorithms and neural network models to assist computer systems in progressively improving
their performance.
Machine learning algorithms automatically build a mathematical model using sample data – also known
as “training data” – to make decisions
Machine learning (ML) has evolved from simple algorithms and rule-based systems to complex models
and large datasets.
1943
Walter Pitts and Warren McCulloch published the first mathematical model of a neural network.
1949
Donald Hebb published The Organization of Behavior, which presented theories on how behavior relates
to neural networks and brain activity.
1950
Alan Turing created the Turing Test to determine if a computer has real intelligence.
1956
Arthur Samuel created a program for playing championship-level computer checkers, popularizing the
term "machine learning".
1967
Cover and Hart published an article on the Nearest Neighbor Algorithm, which automatically identifies
patterns within large datasets.
2002
The release of Torch, an open-source software library, made it easier for researchers and developers to
build machine learning models.
Rise of GPUs
Graphics Processing Units (GPUs) excel at parallel processing, making them ideal for training complex
neural networks.
A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount
of data, as the huge amount of data helps to build a better model which predicts the output more
accurately.
Machine learning (ML) is a subset of artificial intelligence (AI) that allows computers to learn and improve
from data without being explicitly programmed. ML systems use algorithms to analyze large data sets,
identify patterns, and make predictions and decisions.
Machine learning accesses vast amounts of data (both structured and unstructured) and learns from it to
predict the future. It learns from the data by using multiple algorithms and techniques. Below is a
diagram that shows how a machine learns from data.
Fig. How does Machine Learning can work with past data
Fraud detection
Product recommendations
Dynamic pricing
1. Supervised Learning
3. Reinforcement Learning.
These paradigms differ in the tasks they can solve and in how the data is presented to the computer.
1. Supervised Learning:
Supervised learning is a type of machine learning method in which we provide sample labeled data to the
machine learning system in order to train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets and learn about each data,
once the training and processing are done then we test the model by providing a sample data to check
whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised learning is
based on supervision, and it is the same as when a student learns things in the supervision of the teacher.
(or)
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
I. Classification -- Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc. Below are some popular Classification
algorithms
II.Regression -- Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms
-> With the help of supervised learning, the model can predict the output on
-> In supervised learning, we can have an exact idea about the classes of objects.
- > Supervised learning model helps us to solve various real-world problems such
2. Unsupervised Learning:
using unlabelled dataset and are allowed to act on that data without any
supervision “
Unsupervised learning is a learning method in which a machine learns without any supervision.
(or)
The training is provided to the machine with the set of data that has not been labeled, classified, or
categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group of objects
with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights
from the huge amount of data.
Clustering -- Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group. Cluster
analysis finds the commonalities between the data objects and categorizes them as per the presence and
absence of those commonalities.
Association -- An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database.
It determines the set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
K-means clustering
Hierarchal clustering
Anomaly detection
Neural Networks
Apriori algorithm
-> Unsupervised learning is used for more complex tasks as compared to supervised learning because, in
unsupervised learning, we don't have labeled input data.
-> Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data.
3. Reinforcement Learning:
The goal of reinforcement learning is to train an agent to complete a task within an uncertain
environment. The agent receives observations and a reward from the environment and sends actions to
the environment. The reward measures how successful action is with respect to completing the task goal.
Automated robots
Image processing
Industrial automation
Resource management
Healthcare
Autonomous vehicles
Gaming
Recommendation systems
ROTE or “Return on Tangible Equity” is a ratio that helps measure a company's profitability.
Rote learning is a memorization technique based on repetition. The method rests on the premise that
the recall of repeated material becomes faster the more one repeats it. Some of the alternatives to rote
learning include meaningful learning, associative learning, spaced repetition and active learning.
Rote learning is the process of memorizing specific new items as they are encountered.
The meaning of rote in ‘rote learning’ itself means learning by repetition. The process of repeating
something over and over engages the short-term memory and allows us to quickly remember basic things
like facts, dates, names, multiplication tables, etc. It differs from other forms of learning in that it doesn’t
require the learner to carefully think about something, and is rather dependent on the act of repetition
itself.
Rote learning is widely used in the mastery of foundational knowledge. Examples of school topics where
rote learning is frequently used include phonics in reading, the periodic table in chemistry, multiplication
tables in mathematics, anatomy in medicine, cases or statutes in law, basic formulae in any science, etc.
Rote learning is widely used in the mastery of foundational knowledge Examples of school topics where
rote learning is frequently used multiplication tables in mathematics, anatomy in medicine, cases or
statutes in law, basic formulae in any science, etc.
Rote learning is also used to describe a simple learning pattern used in machine learning, although it does
not involve repetition, unlike the usual meaning of rote learning. The machine is programmed to keep a
history of calculations and compare new input against its history of inputs and outputs, retrieving the
stored output if present. This pattern requires that the machine can be modeled as a pure function —
always producing same output for same input — and can be formally described as follows:
Rote learning is considered useful for a variety of reasons. Here are a few
With rote, one can remember just about anything over time and repetition.
Rote learning allows one to recall information wholly, and even to retain it for life.
Rote learning makes it easier for people to score who find it difficult to understand or master read
On the other hand, there are a few drawbacks of rote learning that you need to be aware of as well
There is no connection between new and old information with rote learning
Learning by Induction:
Inductive learning is a machine learning technique that uses a labeled dataset to train a model to
generalize and make predictions about new data:
Inductive learning involves a two-phase process: training and testing. During the training phase,
the machine learning model learns from a labeled dataset, where the input data is paired with their
corresponding outputs.
1. Training
The model analyzes a labeled dataset to learn patterns and build a generalized representation of the
data.
2. Testing
The trained model is tested on a separate dataset to evaluate its performance. The model's ability to
generalize and make accurate predictions on new data is assessed.
3. Refining
The model's hypotheses are refined based on feedback from the evaluation step. The model is updated
or revised to improve its performance and generalize better to new instances.
Inductive learning is also known as concept learning. It's a way for AI systems to use a generalized rule to
carry out observations.
Applications:
Disease diagnosis,
Face recognition,
Autonomous driving.
Reinforcement Learning:
Reinforcement learning (RL) is a machine learning (ML) technique that trains software to
make decisions to achieve the most optimal results.
It mimics the trial-and-error learning process that humans use to achieve their goals.
(or)
Reinforcement learning (RL) is a machine learning technique that teaches software to make
decisions by using a reward-and-punishment system.
RL mimics the way humans learn through trial and error, where actions that lead to a
desired outcome are reinforced, while actions that don't are ignored.
In simple words, we can say that the output depends on the state of the current input and
the next input depends on the output of the previous input
4.Healthcare
5.Robotics
6.Marketing
7.Gaming
Types of Data:
Incorrect identification of data types leads to incorrect modeling which in turn leads to an
incorrect solution.
It refers to the set of observations or measurements that can be used to train a machine-
learning model.
The quality and quantity of data available for training and testing play a significant role in
determining the performance of a machine-learning model.
Data can be in various forms such as numerical, categorical, or time-series data, and can
come from various sources such as databases, spreadsheets, or APIs.
Here I will be discussing different types of data types with suitable examples.
1.Quantitative
2.Qualitative
This type of data type consists of numerical values. Anything which is measured by numbers.
This type of variable value if expressed in decimal format will have no proper meaning. Their
values can be counted.
E.g.: – No. of cars you have, no. of marbles in containers, students in a class, etc.
The numerical measures which can take the value within a certain range.
This type of variable value if expressed in decimal format has true meaning.
Their values can not be counted but measured. The value can be infinite
These are the data types that cannot be expressed in numbers. This describes categories or
groups and is hence known as the categorical data type.
This type of data is either number or words. This can take numerical values but
mathematical operations cannot be performed on it. This type of data is expressed in
tabular format.
E.g.) Sunny=1, cloudy=2, windy=3 or binary form data like 0 or1, Good or bad, etc.
This type of data does not have the proper format and therefore known as unstructured
data. This comprises textual data, sounds, images, videos, etc.
Besides this, there are also other types refer as Data Types preliminaries or Data Measures:-
Nominal
Ordinal
Interval
Ratio
This is in use to express names or labels which are not order or measurable.
This is also a categorical data type like nominal data but has some natural ordering
associated with it.
This is numeric data which has proper order and the exact zero means the true absence of a
value attached. Here zero means not a complete absence but has some value. This is the
local scale.
E.g., Temperature measured in degree Celsius, time, Sat score, credit score, pH, etc.
difference between values is familiar. In this case, there is no absolute zero. Absolute
This quantitative data type is the same as the interval data type but has the absolute zero.
Here zero means complete absence and the scale starts from zero. This is the global scale.
Matching:
Matching is a process that uses machine learning algorithms to compare data sets and
identify matches between records.
The goal of data matching is to identify and compare data to find the data points that refer
to the same entity.
Data matching can help identify duplicate records, detect patterns and irregularities, and
improve the accuracy of searches
Matching algorithms can be used to pair users with products, services, or information, such
as recommending products on e-commerce platforms or matching job seekers with
opportunities
To train a machine with specific data we have to follow predefined steps and this whole
process is known as a machine learning lifecycle.
The goal of the 7 Stages framework is to break down all necessary tasks in Machine Learning
and organize them in a logical way.
1.Problem Definition
2.Data Collection
3.Data Preparation
4.Data Visualization
5.ML Modeling
6.Feature Engineering
7.Model Deployment
These 7 stages are the key steps in our framework. We have categorized them additionally
into groups to get a better understanding of the larger picture.
i.Business Value
iii.Production
It is absolutely crucial to adopt a business mindset when thinking about a problem that
should be solved with Machine Learning — defining customer benefits and creating business
impact is top priority.
Domain expertise and knowledge is also essential as the true power of data can only be
harnessed if the domain is well known and understood.
From Data Collection to Feature Engineering, 5 stages of our ML framework are included
here.
Core of any POC to test an idea in terms of its feasibility and value to the business.
Also, questions around performance and evaluation metrics are answered in that phase.
Phase 3 — Production
In the third phase, one is taking the ML model and scaling it.
The goal is to integrate Machine Learning into a business process solving a problem with a
superior solution compared to, for example, traditional programming.
The process of taking a trained ML model and making its predictions available to users or
other systems is known as model deployment.
1. Problem Definition
The first stage in the DDS Machine Learning Framework is to define and understand the
problem that someone is going to solve.
Start by analyzing the goals and the why behind a particular problem statement.
Understand the power of data and how one can use it to make a change and drive results.
Here we can arise Few possible questions like What is the business?, Why does the problem
need to be solved? Is a traditional solution available to solve the problem?,If probabilistic in
nature, then does available data allow to model it?, What is a measurable business goal?
2. Data Collection
Once the goal is clearly defined, one has to start getting the data that is needed from
various available data sources.
Here we can arise Few possible questions like What data do I need for my project? Where is
that data available?, How can I obtain it?, What is the most efficient way to store and access
all of it?
There are many different ways to collect data that is used for Machine Learning. For
example, focus groups, interviews, surveys, and internal usage & user data.
Also, public data can be another source and is usually free. These include research and
trade associations such as banks, publicly-traded corporations, and others.
If data isn’t publicly available, one could also use web scraping to get it (however, there are
some legal restrictions).
3. Data Preparation
Data Preparation can take up to 70% and sometimes even 90% of the overall project time.
But what is the purpose of this stage?
Well, the type and quality of data that is used in a Machine Learning model affects the
output considerably.
In Data Preparation one explores, pre-processes, conditions, and transforms data prior to
modeling and analysis.
Data Filtering
Data Formatting
4. Data Visualization
Visualization is an incredibly helpful tool to identify patterns and trends in data, which leads
to clearer understanding and reveals important insights.
Data Visualization also helps for faster decision making through the graphical illustration.
Area Chart
Bar Chart
Box-and-whisker Plots
Bubble Cloud
Heat Map
Histogram
Network Diagram
Word Cloud
5. ML Modeling
Finally, this is where ‘the magic happens’. Machine Learning is finding patterns in data, and
one can perform either supervised or unsupervised learning.
In this stage of the process one has to apply mathematical, computer science, and business
knowledge to train a Machine Learning algorithm that will make predictions based on the
provided data
6. Feature Engineering
Machine Learning algorithms learn recurring patterns from data. Carefully engineered
features are a robust representation of those patterns.
It is a collection of methods for identifying an optimal set of inputs to the Machine Learning
algorithm. Feature Engineering is extremely important because well-engineered features
make learning possible with simple models.
7. Model Deployment
The last stage is about putting a Machine Learning model into a production environment to
make data-driven decisions in a more automated way.
Robustness, compatibility, and scalability are important factors that should be tested and
evaluated before deploying a model.
There are various ways such as Platform as a Service (PaaS) or Infrastructure as a Service
(IaaS).
Data Acquisition:
In Machine learning Data acquisition is the process of gathering and preparing data from
various sources to train a machine learning model.
It's the first step in the machine learning process and is critical for the effectiveness of the
model.
A key component of the data acquisition process is the analog-to-digital converter (ADC),
which transforms the signal into data that the processor can understand
Data acquisition in the context of Machine Learning refers to the process of collecting,
gathering, and preparing data from various sources to build and train a machine learning
model.
Databases: Extracting data from structured databases such as SQL or NoSQL databases.
Files: Gathering data from CSV files, Excel spreadsheets, text files, and more.
APIs: Retrieving data from Application Programming Interfaces (APIs) provided by various
online platforms.
Web Scraping: Extracting data from websites by parsing their HTML content.
Sensors and IoT Devices: Collecting data from sensors and Internet of Things (IoT) devices.
Feature Engineering:
Feature engineering is a machine learning technique that involves transforming raw data
into features that can be used to train and make predictions by Machine learning models.
Feature engineering is a machine learning technique that leverages data to create new
variables that aren't in the training set.
Importance The accuracy of a machine learning model depends on the quality of the data
used for training, so feature engineering is a crucial preprocessing technique.
It involves selecting relevant information from raw data and transforming it into a format
that can be easily understood by a model.
The goal is to improve model accuracy by providing more meaningful and relevant
information.
Feature engineering is the process of transforming raw data into features that are suitable
for machine learning models.
In other words, it is the process of selecting, extracting, and transforming the most relevant
features from the available data to build more accurate and efficient machine learning
models
Data Representation:
The word data refers to constituting people, things, events, ideas. It can be a
title, an integer, or anycast. After collecting data the investigator has to
condense them in tabular form to study their salient features. Such an
arrangement is known as the presentation of data.
The row can be placed in different orders like it can be presented in ascending
orders, descending order, or can be presented in alphabetical order.
Example: Let the marks obtained by 10 students of class V in a class test, out of
50 according to their roll numbers, be:
The data in the given form is known as raw data. The above given data can be
placed in the serial order as shown below:
Model Selection:
In machine learning, the process of selecting the top model or algorithm from a list of
potential models to address a certain issue is referred to as model selection.
It entails assessing and contrasting various models according to how well they function and
choosing the one that reaches the highest level of accuracy or prediction power.
Because different models have varied levels of complexity, underlying assumptions, and
capabilities, model selection is a crucial stage in the machine-learning pipeline.
Finding a model that fits the training set of data well and generalizes well to new data is the
objective.
While a model that is too complex may overfit the data and be unable to generalize, a
model that is too simple could underfit the data and do poorly in terms of prediction.
Problem formulation: Clearly express the issue at hand, including the kind of predictions or
task that you'd like the model to carry out (for example, classification, regression, or
clustering).
Candidate model selection: Pick a group of models that are appropriate for the issue at
hand. These models can include straightforward methods like decision trees or linear
regression as well as more sophisticated ones like deep neural networks, random forests, or
support vector machines.
Performance evaluation: Establish measures for measuring how well each model performs.
Common measurements include area under the receiver's operating characteristic curve
(AUC-ROC), recall, F1-score, mean squared error, and accuracy, precision, and recall.
The type of problem and the particular requirements will determine which metrics are used.
Training and evaluation: Each candidate model should be trained using a subset of the
available data (the training set), and its performance should be assessed using a different
subset (the validation set or via cross-validation). The established evaluation measures are
used to gauge the model's effectiveness.
Model comparison: Evaluate the performance of various models and determine which one
performs best on the validation set. Take into account elements like data handling
capabilities, interpretability, computational difficulty, and accuracy.
Final model selection: After the models have been analysed and fine-tuned, pick the model
that performs the best. Then, this model can be used to make predictions based on fresh,
unforeseen data.
Model Evaluation:
The metrics selection for the analysis varies depending on the data, algorithm, and use case.
The Importance of in ML model evaluation ensures that production models’ performance is:
Reliable: The productionized model(s) behaves as expected. The behavioural profile of the
model is an in-depth review of how the model maps inputs to outputs—overall and with
respect to specific data slices—as defined by feature contribution, counterfactual analysis,
and fairness tests.
So we must also use some techniques to determine the predictive power of the model.
Evaluation is always good in any field, right? In the case of machine learning, it is best
practice
To evaluate the performance of such a model there are metrics as mentioned below:
Classification Accuracy
Logarithmic loss
F1 score
Precision
Recall
Confusion Matrix
Classification Accuracy:
This is calculated as the ratio of correct predictions to the total number of input Samples.
Accuracy= Totalnumberofinputsamples/No.ofcorrectpredictions
For example, we have a 90% sample of class A and a 10% sample of class B in our training
set. Then, our model will predict with an accuracy of 90% by predicting all the training
samples belonging to class A. If we test the same model with a test set of 60% from class A
and 40% from class B. Then the accuracy will fall, and we will get an accuracy of 60%.
Log Loss
where,
N : no. of samples.
M : no. of attributes.
The AUC-ROC curve, or Area Under the Receiver Operating Characteristic curve, is a
graphical representation of the performance of a binary classification model at various
classification thresholds. It is commonly used in machine learning to assess the ability of a
model to distinguish between two classes, typically the positive class (e.g., presence of a
disease) and the negative class (e.g., absence of a disease).
AUC stands for the Area Under the Curve, and the AUC curve represents the area under the
ROC curve.
It measures the overall performance of the binary classification model. It represents the
probability with which our model can distinguish between the two classes present in our
target.
Where,
The F1-score is a measure of a model’s performance that combines precision and recall. It is
defined as the harmonic mean of precision and recall, where the best value is 1 and the
worst value is 0.
It is calculated as the number of true positive predictions divided by the number of true
positive and false positive predictions.
Confusion Matrix:
It is a means of displaying the number of accurate and inaccurate instances based on the
model’s predictions. It is often used to measure the performance of classification models,
which aim to predict a categorical label for each input instance.
The matrix displays the number of instances produced by the model on the test data.
True Positive (TP): The model correctly predicted a positive outcome (the actual outcome
was positive).
True Negative (TN): The model correctly predicted a negative outcome (the actual outcome
was negative).
False Positive (FP): The model incorrectly predicted a positive outcome (the actual outcome
was negative). Also known as a Type I error.
False Negative (FN): The model incorrectly predicted a negative outcome (the actual
outcome was positive). Also known as a Type II error.
Model Prediction:
Predictive modelling is a process used in data science to create a mathematical model that
predicts an outcome based on input data.
It involves using statistical algorithms and machine learning techniques to analyze historical
data and make predictions about future or unknown events.
In predictive modelling, the goal is to build a model that can accurately predict the target
variable (the outcome we want to predict) based on one or more input variables (features).
The model is trained on a dataset that includes both the input variables and the known
outcome, allowing it to learn the relationships between the input variables and the target
variable.
Once the model is trained, it can be used to make predictions on new data where the target
variable is unknown.
The accuracy of the predictions can be evaluated using various metrics, such as accuracy,
precision, recall, and F1 score, depending on the nature of the problem.
Predictive modelling is used in a wide range of applications, including sales forecasting, risk
assessment, fraud detection, and healthcare.
It can help businesses make informed decisions, optimize processes, and improve outcomes
based on data-driven insights.
Decision Making:
It helps businesses and organizations make informed decisions by providing insights into
future trends and outcomes based on historical data.
Risk Management:
It helps in assessing and managing risks by predicting potential outcomes and allowing
organizations to take proactive measures.
Resource Optimization:
It helps in optimizing resources such as time, money, and manpower by providing forecasts
and insights that can be used to allocate resources more efficiently.
Customer Insights:
Competitive Advantage:
Cost Reduction:
By predicting future outcomes, organizations can reduce costs associated with errors,
inefficiencies, and unnecessary expenditures.
Finance
Risk Assessment:
Predictive modeling helps banks and financial institutions assess the creditworthiness of
individuals and businesses, making lending decisions more informed and reducing the risk of
defaults.
Fraud Detection:
By analysing patterns in transactions and account activity, predictive modeling can detect
fraudulent activities and prevent financial losses.
Healthcare
Disease Prediction:
Predictive modeling can help healthcare professionals predict the likelihood of diseases
such as diabetes, heart disease, and cancer in patients, allowing for early intervention and
personalized treatment plans.
Resource Allocation:
Hospitals and healthcare facilities can use predictive modeling to forecast patient
admissions, optimize staffing levels, and ensure the availability of resources such as beds
and medications.
Customer Segmentation:
Churn Prediction:
By analysing customer data, predictive modeling can predict which customers are likely to
churn (stop using a service or product), enabling companies to take proactive steps to retain
them.
Demand Forecasting:
Predictive modeling helps companies forecast demand for their products, ensuring that they
maintain optimal inventory levels and reduce stockouts or overstock situations.
Logistics Optimization:
By analysing historical data and external factors, predictive modeling can optimize logistics
operations, such as routing, transportation modes, and warehouse locations, to improve
efficiency and reduce costs.
Human Resources
Talent Acquisition:
Predictive modeling can help HR departments identify the best candidates for job openings
by analysing resumes, past performance, and other relevant data.
Employee Retention:
By analysing factors that contribute to employee turnover, predictive modeling can help
companies implement strategies to retain top talent and reduce turnover rates.
Search algorithms help in finding optimal solutions to specific tasks, while machine learning
algorithms enable systems to learn and adapt to new data and situations, making AI
applications more intelligent and effective.
In machine learning, search refers to the process of finding the best algorithm or model to
make the most accurate predictions or decisions based on input data.
The machine searches through many possible solutions, parameters, or models to find the
one that works best.
An effective machine learning search engine goes beyond simple search or AI techniques.
Automatically find relevant content personalized for users.
Machine learning is a branch of artificial intelligence (AI) that allows computers to learn and
improve from data without being explicitly programmed.
The goal of machine learning is to create machines that can learn from data to improve the
accuracy of their output.
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions.
But can a machine also learn from experiences or past data like a human does? So here
comes the role of Machine Learning.
Data Sets:
A dataset in machine learning is a collection of data that a computer treats as a single unit.
Datasets are used to train and test algorithms and models. They can be used to teach
machine learning algorithms to find patterns in the data.
Dataset is essentially the backbone for all operations, techniques or models used by
developers to interpret them.
Datasets involve a large amount of data points grouped into one table.
Datasets are used in almost all industries today for various reasons.
A Dataset is a set of data grouped into a collection with which developers can work to
meet their goals. In a dataset, the rows represent the number of data points and the
columns represent the features of the Dataset.
Fig.dataset
Types of Datasets
There are various types of datasets available out there. They are:
Numerical Dataset: They include numerical data points that can be solved with equations.
Categorical Dataset: These include categories such as colour, gender, occupation, games,
sports and so on.
Web Dataset: These include datasets created by calling APIs using HTTP requests and
populating them with values for data analysis. These are mostly stored in JSON (JavaScript
Object Notation) formats.
These include datasets between a period, for example, changes in geographical terrain over
time.
Image Dataset: It includes a dataset consisting of images. This is mostly used to differentiate
the types of diseases, heart conditions and so on.
Ordered Dataset: These datasets contain data that are ordered in ranks, for example,
customer reviews, movie ratings and so on.
Partitioned Dataset: These datasets have data points segregated into different members or
different partitions.
File-Based Datasets: These datasets are stored in files, in Excel as .csv, or .xlsx files.
Bivariate Dataset: In this dataset, 2 classes or features are directly correlated to each other.
For example, height and weight in a dataset are directly related to each other.
Multivariate Dataset: In these types of datasets, as the name suggests 2 or more classes are
directly correlated to each other.
For example, attendance, and assignment grades are directly correlated to a student’s
overall grade.