ocs353-data-science-fundamentals-notes
ocs353-data-science-fundamentals-notes
Downloaded by Vijaya K
Studocu is not sponsored or endorsed by any college or university
Downloaded by Vijaya K
GNANAMANI COLLEGE OF TECHNOLOGY
(An Autonomous Institution)
|Affiliated to Anna University - Chennai, Approved by AICTE - New Delhi|
(Accredited by NBA & NAAC with "A" Grade)
|NH-7, A.K.SAMUTHIRAM, PACHAL (PO), NAMAKKAL – 637018|
(Regulation 2021)
Prepared By
Mr.K.VIJAYPRABAKARAN AP/CSE
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
UNIT I
INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research goals –
Retrieving data – Data preparation - Exploratory Data analysis – Build the model– presenting findings and
building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data.
Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.
Commercial companies in almost every industry use data science and big data to gain insights into
their customers, processes, staff, completion, and products.
Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up- sell, and personalize their offerings.
Governmental organizations are also aware of data’s value. Many governmental organizations not only
rely on internal data scientists to discover valuable information, but also share their data with the
public.
Nongovernmental organizations (NGOs) use it to raise money and defend their causes.
Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional classes.
2. FACETS OF DATA
In data science and big data you’ll come across many different types of data, and each of them tends to
require different tools and techniques. The main categories of data are these:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
Let’s explore all these interesting data types.
2. 1 Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record. As such,
it’s often easy to store structured data in tables within databases or Excel files
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
1|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
2|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email
3|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
“Graph data” can be a confusing term because any data can be shown in a graph.
Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical data.
Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.
Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.
MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game analytics.
Recently a company called DeepMind succeeded at creating an algorithm that’s capable of learning
how to play video games.
This algorithm takes the video screen as input and learns to interpret everything via a complex of
deep learning.
The data flows into the system when an event happens instead of being loaded into a data store in
a batch.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
3|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
1. The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project
this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and transformation before it becomes
usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from
a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from different data sources, and
transform it. If you have successfully completed this step, you can progress to data
visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It
is now that you attempt to gain the insights or make the predictions stated in your project
charter. Now is the time to bring out the heavy guns, but remember research has taught us that
often (but not always) a combination of simple models tends to outperform one complicated
model. If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You may
still need to convince the business that your findings will indeed change the business process
as expected. This is where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects require you to
perform the business process over and over again, so automating the project will save time.
4|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
5. RETRIEVING DATA
The next step in data science is to retrieve the required data. Sometimes you need to go into the field and
design a data collection process yourself, but most of the time you won’t be involved in this step.
Many companies will have already collected and stored the data for you, and what they don’t have can
often be bought from third parties.
More and more organizations are making even high-quality data freely available for public and
commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.
5.1 Start with data stored within the company (Internal data)
Most companies have a program for maintaining key data, so much of the cleaning work may
already be done. This data can be stored in official data repositories such as databases, data
marts, data warehouses, and data lakes maintained by a team of IT professionals.
Data warehouses and data marts are home to preprocessed data, data lakes contain data in its
natural or raw format.
Finding data even within your own company can sometimes be a challenge. As companies
grow, their data becomes scattered around many places. the data may be dispersed as people
change positions and leave the company.
Getting access to data is another difficult task. Organizations understand the value and
sensitivity of data and often have policies in place so everyone has access to what they need
and nothing more.
These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.
5|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Your model needs the data in a specific format, so data transformation will always come into play. It’s a good
habit to correct data errors as early on in the process as possible. However, this isn’t always possible in a
realistic setting, so you’ll need to take corrective actions in your program.
6.1 CLEANSING DATA
Data cleansing is a sub process of the data science process that focuses on removing errors in your
data so your data becomes a true and consistent representation of the processes it originates from.
The first type is the interpretation error, such as when you take the value in your data for
granted,
like saying that a person’s age is greater than 300 years.
The second type of error points to inconsistencies between data sources or against your
company’s standardized values.
An example of this class of errors is putting “Female” in one table and “F” in another when they represent the
same thing: that the person is female.
Overview of common
Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of
individual observations on the regression line.
6|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Most errors of this type are easy to fix with simple assignment statements and if-then else
rules: if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
6.3 Redundant Whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the
observations that couldn’t be matched.
If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and
trailing whitespaces. For instance, in Python you can use the strip() function to remove leading
and trailing spaces.
6.4 Fixing Capital Letter Mismatches
Capital letter mismatches are common. Most programming languages make a distinction between
“Brazil” and “brazil”.
In this case you can solve the problem by applying a function that returns both strings in lowercase,
such as
7|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
6.6 Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest
way to find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the
upper side when a normal distribution is expected.
Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.
8|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Figure. Appending data from tables is a common operation but requires an equal
structure in the tables begin appended,
Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form
for data modeling.
Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimation
problem dramatically. Transforming the input variables greatly simplifies the estimation problem. Other times
you might want to combine two variables into a new variable.
9|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Figure shows how reducing the number of variables makes it easier to understand the key values. It also
shows how two variables account for 50.6% of the variation within the data set (component1 = 27.8% +
component2 = 22.8%). These variables, called “component1” and “component2,” are both combinations of
the original variables. They’re the principal components of the underlying data structure
Figure. Turning variables into dummies is a data transformation that breaks a variable that has
multiple classes into multiple variables, each having only two possible values: 0 or 1
During exploratory data analysis you take a deep dive into the data (see figure below). Information
becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques
to gain an understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.
11 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
into the data Other times the graphs can be animated or made interactive to make it easier and,
let’s admit it, way more fun.
The techniques we described in this phase are mainly visual, but in practice they’re certainly not
limited to visualization techniques. Tabulation, clustering, and other modeling techniques can
also be a part of exploratory analysis. Even building simple models can be a part of this step.
This phase is much more focused than the exploratory analysis step, because you know what you’re looking
for and what you want the outcome to be.
Building a model is an iterative process. The way you build your model depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the type of technique you
want to use. Either way, most models consist of the following main steps:
12 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Model execution
Once you’ve chosen a model you’ll need to implement it in code.
Most programming languages, such as Python, already have libraries such as StatsModels or
Scikit- learn. These packages use several of the most popular techniques.
Coding a model is a nontrivial task in most cases, so having these libraries available can speed up
the process. As you can see in the following code, it’s fairly easy to use linear regression with
StatsModels or Scikit-learn
Doing this yourself would require much more effort even for the simple techniques. The
following listing shows the execution of a linear prediction model.
Mean square error is a simple measure: check for every prediction how far it was from the truth, square
this error, and add up the error of every prediction.
13 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Above figure compares the performance of two models to predict the order size from the price. The
first model is size = 3 * price and the second model is size = 10.
To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%),
without showing the other 20% of data to the model.
Once the model is trained, we predict the values for the other 20% of the variables based on
those for which we already know the true value, and calculate the model error with an error
measure.
Then we choose the model with the lowest error. In this example we chose model 1 because it
has the lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you have to verify that
these assumptions are indeed met. This is called model diagnostics.
Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.
This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s
sufficient that you implement only the model scoring; other times you might build an application
that automatically updates reports, Excel spreadsheets, or PowerPoint presentations. The last
stage of the data science process is where your soft skills will be most useful, and yes, they’re
extremely important.
Data mining is the process of discovering actionable information from large sets of data. Data mining
uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns
cannot be discovered by traditional data exploration because the relationships are too complex or
because there is too much data.
These patterns and trends can be collected and defined as a data mining model. Mining models can be
applied to specific scenarios, such as:
Building a mining model is part of a larger process that includes everything from asking questions about
the data and creating a model to answer those questions, to deploying the model into a working
environment. This process can be defined by using the following six basic steps:
The following diagram describes the relationships between each step in the process, and the
technologies in Microsoft SQL Server that you can use to complete each step.
The first step in the data mining process is to clearly define the problem, and consider ways that data
can be utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of the problem, defining the
metrics by which the model will be evaluated, and defining specific objectives for the data mining
project. These tasks translate into questions such as the following:
What are you looking for? What types of relationships are you trying to find?
Does the problem you are trying to solve reflect the policies or processes of the business?
Do you want to make predictions from the data mining model, or just look for interesting
patterns and associations?
Which outcome or attribute do you want to try to predict?
What kind of data do you have and what kind of information is in each column? If there are
multiple tables, how are the tables related? Do you need to perform any cleansing, aggregation, or
processing to make the data usable?
How is the data distributed? Is the data seasonal? Does the data accurately represent the
processes of the business?
15 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Preparing Data
The second step in the data mining process is to consolidate and clean the data that was
identified in the Defining the Problem step.
Data can be scattered across a company and stored in different formats, or may contain
inconsistencies such as incorrect or missing entries.
Data cleaning is not just about removing bad data or interpolating missing values, but about
finding hidden correlations in the data, identifying sources of data that are the most accurate, and
determining which columns are the most appropriate for use in analysis
Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and
standard deviations, and looking at the distribution of the data. For example, you might determine by
reviewing the maximum, minimum, and mean values that the data is not representative of your
customers or business processes, and that you therefore must obtain more balanced data or review the
assumptions that are the basis for your expectations. Standard deviations and other distribution values
can provide useful information about the stability and accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually contain any data until you
process it. When you process the mining structure, SQL Server Analysis Services generates aggregates
and other statistical information that can be used for analysis. This information can be used by any
mining model that is based on the structure.
Before you deploy a model into a production environment, you will want to test how well the model
performs. Also, when you build a model, you typically create multiple models with different
configurations and test all models to see which yields the best results for your problem and your data.
After the mining models exist in a production environment, you can perform many tasks, depending on
your needs. The following are some of the tasks you can perform:
Use the models to create predictions, which you can then use to make business decisions.
Create content queries to retrieve statistics, rules, or formulas from the model.
Embed data mining functionality directly into an application. You can include Analysis
Management Objects (AMO), which contains a set of objects that your application can use to
create, alter, process, and delete mining structures and mining models.
Use Integration Services to create a package in which a mining model is used to intelligently
separate incoming data into multiple tables.
Create a report that lets users directly query against an existing mining model
Update the models after review and analysis. Any update requires that you reprocess the models.
Update the models dynamically, as more data comes into the organization, and making constant
changes to improve the effectiveness of the solution should be part of the deployment strategy.
16 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the
overall processes of a business. Such subjects may be sales, promotion, inventory, etc
Integrated
A data warehouse is developed by integrating data from varied sources into a consistent
format. The data must be stored in the warehouse in a consistent and universally acceptable
manner in terms of naming, format, and coding. This facilitates effective data analysis.
Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only.
Previous data is not erased when current data is entered. This helps you to analyze what has
happened and when.
Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly
or implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key,
which must have an element of time like the day, week, or month.
Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-end
tools are used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways.
The ROLAP or Relational OLAP model is an extended relational database management
system that maps multidimensional data process to standard relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data and
operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse. It holds various tools
like query tools, analysis tools, reporting tools, and data mining tools.
Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns
17 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
in vast volumes of data and devising innovative strategies for increased sales and profits.
3. Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or
business unit. Every department of a business has a central repository or data mart to store data. The
data from the data mart is stored in the ODS periodically. The ODS then sends the data to the EDW,
where it is stored and used.
Summary
In this chapter you learned the data science process consists of six steps:
Setting the research goal - Defining the what, the why, and the how of your project in a
project charter.
Retrieving data - Finding and getting access to data needed in your project. This data is either
found within the company or retrieved from a third party.
Data preparation - Checking and remediating data errors, enriching the data with data from
other data sources, and transforming it into a suitable format for your models.
Data exploration - Diving deeper into your data using descriptive statistics and visual techniques.
Data modeling - Using machine learning and statistical techniques to achieve your project goal.
Presentation and automation - Presenting your results to the stakeholders and industrializing
your analysis process for repetitive reuse and integration with other tools.
*************************************************************************************************************
18 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
UNIT II
DATA MANIPULATION
Python Shell - Jupyter Notebook - IPython Magic Commands - NumPy Arrays-Universal Functions – Aggregations
- Computation on Arrays - Fancy Indexing - Sorting arrays - Structured data - Data manipulation with
Pandas - Data Indexing and Selection - Handling missing data - Hierarchical indexing - Combining datasets
- Aggregation and Grouping - String operations - Working with time series - High performance.
1. PYTHON SHELL
A Python shell, also known as an interactive Python interpreter or REPL (Read-Eval-Print Loop), is a
command-line interface where you can interactively execute Python code, experiment with ideas, and
see immediate results.
In a Python shell:
1. You can enter Python statements or expressions.
2. The shell evaluates the code and displays the results.
3. You can access and manipulate variables, functions, and modules.
4. The shell provides features like auto-completion, syntax highlighting, and error handling.
Some popular Python shells include:
1. IDLE: A basic shell that comes bundled with Python.
2. IPython: An enhanced shell with features like syntax highlighting, auto-completion, and
visualization tools.
3. Jupyter Notebook: A web-based interactive environment that combines a shell with notebook-
style documentation and visualization.
4. PyCharm: An integrated development environment (IDE) that includes a Python shell.
5. Python Interpreter: The default shell that comes with Python, accessible from the command line
or terminal.
Python shells are useful for:
Quick experimentation and prototyping
Learning and exploring Python syntax and libraries
Debugging and testing code snippets
Interactive data analysis and visualization
19 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
2. JUPYTER NOTEBOOK
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more.
Jupyter has support for over 40 different programming languages and Python is one of them. Python
is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook itself.
2.1 Installation
Install Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter
Notebook, and other commonly used packages for scientific computing and data science. You can
download Anaconda’s latest Python3 version. Now, install the downloaded version of Anaconda.
Installing Jupyter Notebook using pip:
python3 -m pip install --upgrade pip
python3 -m pip install jupyter
After the notebook is opened, you’ll see the Notebook Dashboard, which will show a list of the
notebooks, files, and subdirectories in the directory where the notebook server was started. Most of the
20 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
time, you will wish to start a notebook server in the highest level directory containing notebooks.
Often this will be your home directory.
21 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
IPython provides several "magic" commands that make it easier to perform common tasks in a
Python notebook. These commands, prefixed with % for line magics and %% for cell magics, offer a
variety of functionalities. Here are some key IPython magic commands:
IPython magic commands are special commands in IPython that start with the % symbol and provide
a wide range of functionality, including:
Here are some more IPython magic commands:
22 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
4. NUMPY ARRAYS
NumPy is a powerful library for numerical computing in Python. It provides support for arrays,
which are more efficient than Python lists for numerical operations. Here are some basic and
advanced operations you can perform with NumPy arrays.
Creating NumPy Arrays
numpy.array(): Create an array from a Python list or tuple
numpy.zeros(): Create an array filled with zeros
numpy.ones(): Create an array filled with ones
numpy.random.rand(): Create an array with random values
NumPy Array Properties
shape: The number of dimensions and size of each dimension
dtype: The data type of the array elements
size: The total number of elements in the array
Indexing and Slicing
arr[index]: Access a single element
arr[start:stop:step]: Access a slice of elements
arr[start:stop]: Access a slice of elements with default step size 1
Basic Operations
arr + arr: Element-wise addition
arr - arr: Element-wise subtraction
arr * arr: Element-wise multiplication
arr / arr: Element-wise division
Advanced Operations
numpy.dot(): Compute the dot product of two arrays
numpy.cross(): Compute the cross product of two arrays
23 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output: Output:
4 15
[3 4 5] 3.0
Reshaping an array Array Comparison
import numpy as np import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6]) arr1 = np.array([1, 2,
arr = arr.reshape(2, 3) 3])
print(arr) arr2 = np.array([1, 2, 4])
print(np.equal(arr1, arr2))
Output: Output:
[[1 2 3] [ True True False]
[4 5 6]]
Concatenating arrays Splitting an array
import numpy as np import numpy as np
arr1 = np.array([1, 2, arr = np.array([1, 2, 3, 4, 5, 6])
3]) arr1, arr2 = np.split(arr, 2)
arr2 = np.array([4, 5, 6]) print(arr1)
arr = np.concatenate((arr1, arr2)) print(arr2)
print(arr)
Output: Output:
[1 2 3 4 5 6] [1 2 3]
[4 5 6]
24 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
import numpy as np
# Creating arrays
a = np.array([1, 2, 3])
b = np.array([(1, 2, 3), (4, 5, 6)])
c = np.arange(0, 10, 2)
d = np.linspace(0, 1, 5)
e = np.zeros((2, 3))
f = np.ones((2, 3))
g = np.eye(3)
h = np.random.random((2, 3))
# Displaying arrays
print("Array a:\n", a)
print("Array b:\n", b)
print("Array c (arange):\n", c)
print("Array d (linspace):\n",
d) print("Array e (zeros):\n", e)
print("Array f (ones):\n", f)
print("Array g (identity matrix):\n", g)
print("Array h (random values):\n", h)
# Array properties
print("Shape of array b:", b.shape)
print("Size of array b:", b.size)
print("Data type of array a:", a.dtype)
# Array operations
i = np.array([1, 2, 3])
j = np.array([4, 5, 6])
print("i + j:\n", i + j)
print("i * j:\n", i * j)
# Matrix operations
k = np.array([[1, 2], [3, 4]])
l = np.array([[5, 6], [7, 8]])
print("Matrix product of k and l:\n", np.dot(k, l))
# Aggregate functions
m = np.array([1, 2, 3, 4, 5])
print("Sum of array m:", np.sum(m))
print("Mean of array m:", np.mean(m))
print("Standard deviation of array m:", np.std(m))
25 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
5. UNIVERSAL FUNCTIONS
Universal functions (ufuncs) in NumPy are functions that operate element-wise on arrays, supporting
broadcasting, type casting, and other standard features. They are essential for performing vectorized
operations, which are both more concise and more efficient than using Python loops.
Key Characteristics of Ufuncs
1. Element-wise Operations: Ufuncs apply operations element-wise, which means they operate
on each element of the input arrays independently.
2. Broadcasting: Ufuncs support broadcasting, which allows them to work with arrays of
different shapes in a flexible manner.
3. Performance: Ufuncs are implemented in C and are optimized for performance, making them
much faster than equivalent Python loops.
Common Ufuncs;-
27 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Example Code:
import numpy as np
# Arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Arithmetic Operations
print("Addition:", np.add(a, b))
print("Subtraction:", np.subtract(a, b))
print("Multiplication:", np.multiply(a, b))
print("Division:", np.divide(a, b))
# Mathematical Functions
print("Square Root:", np.sqrt(a))
print("Exponential:", np.exp(a))
print("Logarithm:", np.log(a))
# Trigonometric Functions
angle = np.array([0, np.pi/2, np.pi])
print("Sine:", np.sin(angle))
print("Cosine:", np.cos(angle))
print("Tangent:", np.tan(angle))
# Statistical Functions
print("Mean:", np.mean(a))
print("Standard Deviation:", np.std(a))
print("Sum:", np.sum(a))
# Comparison Operators
print("Greater Than:", np.greater(a, b))
print("Less Than:", np.less(a, b))
print("Equal:", np.equal(a, b))
6. AGGREGATIONS
Aggregation in data science refers to the process of summarizing or combining multiple data points to produce
a single result or a smaller set of results. This is a fundamental concept used to simplify and analyze large
datasets, making it easier to draw insights and make decisions. Aggregation can be performed in various ways,
depending on the type of data and the analysis being conducted.
Definition: Aggregation is the process of combining multiple pieces of data to produce a summary
result.
28 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Purpose: The primary purpose of aggregation is to simplify and summarize data, making it easier to
analyze and interpret. This helps in identifying trends, patterns, and anomalies.
6. 1 AGGREGATION TECHNIQUES
Group By: Grouping data based on one or more columns and then applying an aggregation function.
For example, grouping sales data by region and then calculating the total sales per region.
Pivot Tables: Reshaping data by turning unique values from one column into multiple columns,
providing a summarized dataset.
Rolling Aggregation: Calculating aggregates over a rolling window, such as a moving average.
Function: SUM()
total_sales = df['sales'].sum()
2. Mean (Average):
Calculates the average value of a dataset.
Calculate the mean value of a column or group of data points.
Function: AVG()
average_age = df['age'].mean()
3. Median:
Finds the middle value in a dataset, which is less affected by outliers than the mean.
median_income = df['income'].median()
4. Mode:
Identifies the most frequently occurring value in a dataset.
most_common_category = df['category'].mode()
most_common_category = df['category'].mode()
5. Count:
Counts the number of entries in a dataset, often used to determine the number of occurrences of a
specific value.
count_of_sales = df['sales'].count()
6. Min and Max:
Finds the minimum and maximum values in a dataset.
Find the maximum or minimum value in a column or group of data points.
Function: MAX()
MIN():
29 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
min_salary = df['salary'].min()
max_salary = df['salary'].max()
7. Standard Deviation and Variance:
Measures the spread or dispersion of the data around the mean.
Calculate the spread or dispersion of a column or group of data points.
Function: STDDEV()
VAR()
std_dev = df['scores'].std()
variance = df['scores'].var()
8. Group By:
Aggregates data based on one or more categories. This is often used in conjunction with other
aggregation functions.
Group data by one or more columns and apply
aggregations. sales_by_region = df.groupby('region')['sales'].sum()
6.2 APPLICATIONS OF AGGREGATION
Descriptive Statistics:
Aggregation is used to describe the main features of a dataset quantitatively. For example,
summarizing the central tendency and dispersion of data.
Data Cleaning:
Aggregation can help in identifying and handling missing values, outliers, and inconsistencies in the
data.
Data Visualization:
Aggregated data is often used to create plots and charts, making it easier to visualize trends and
patterns.
Feature Engineering:
Aggregation can be used to create new features from existing data, improving the performance of
machine learning models.
Reporting:
Aggregated data is commonly used in business reports and dashboards to provide a high-level
overview of key metrics.
Example Code: Using Pandas
import pandas as pd
# Sample data
data = {
'region': ['North', 'South', 'East', 'West', 'North', 'South'],
'sales': [250, 150, 200, 300, 400, 100],
'expenses': [100, 50, 80, 120, 150, 60]
30 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
}
df =
pd.DataFrame(data) #
Sum of sales
total_sales =
df['sales'].sum() # Average
expenses
average_expenses = df['expenses'].mean()
# Sales by region
sales_by_region = df.groupby('region')['sales'].sum()
print(f"Total Sales: {total_sales}")
print(f"Average Expenses: {average_expenses}")
print("Sales by Region:")
print(sales_by_region)
7. COMPUTATION ON ARRAYS
Computation on arrays is a fundamental aspect of data science, enabling efficient data manipulation,
analysis, and machine learning. Arrays, especially as implemented in libraries like NumPy, provide a
powerful way to handle large datasets and perform a wide range of mathematical operations. Here,
we'll explore the essential aspects of array computations in data science.
Key Concepts
1. Array Creation and Initialization
Creating and initializing arrays is the first step in performing any computation. Arrays can be created
from lists, using functions like np.array, or from scratch using functions like np.zeros, np.ones, and
np.full.
import numpy as np
# From a list
arr = np.array([1, 2, 3, 4])
# From scratch
zeros = np.zeros((3, 3))
ones = np.ones((2, 2))
full = np.full((2, 3), 7)
2. Array Operations
NumPy supports a variety of element-wise operations, such as addition, subtraction, multiplication,
and division, as well as more complex mathematical functions like exponentiation, logarithms, and
trigonometric functions.
# Element-wise operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
31 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
sum_arr = arr1 + arr2
diff_arr = arr1 - arr2
32 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
4. Aggregation
Aggregation functions like sum, mean, median, min, and max help summarize
data. arr = np.array([[1, 2, 3], [4, 5, 6]])
total_sum = np.sum(arr)
column_mean = np.mean(arr, axis=0)
row_max = np.max(arr, axis=1)
5. Linear Algebra
NumPy provides support for linear algebra operations, including dot products, matrix multiplication,
determinants, and inverses.
# Dot product
vec1 = np.array([1, 2])
vec2 = np.array([3, 4])
dot_product = np.dot(vec1, vec2)
# Matrix multiplication
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
mat_mult = np.matmul(mat1, mat2)
7.1 BROADCASTING
Broadcasting is a powerful mechanism in NumPy (a popular library for numerical computations in
Python) that allows for element-wise operations on arrays of different shapes. When performing
arithmetic operations, NumPy automatically stretches the smaller array along the dimension with
size 1 to match the shape of the larger array. This allows for efficient computation without the need
for explicitly replicating the data.
Broadcasting Rules:
To understand how broadcasting works, it's important to know the rules that govern it:
33 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
If the arrays differ in their number of dimensions, the shape of the smaller array is padded
with ones on its left side.
If the shape of the two arrays does not match in any dimension, the array with shape equal to
1 in that dimension is stretched to match the other shape.
If in any dimension the sizes are different and neither is equal to 1, an error is
raised Broadcasting follows a set of rules to make arrays compatible for element-wise
operations:
Align Shapes: If the arrays have different numbers of dimensions, the shape of the smaller
array is padded with ones on its left side.
Shape Compatibility: Arrays are compatible for broadcasting if, in all dimensions, the
following is true:The dimension sizes are equal, orOne of the dimensions is 1.
Result Shape: The resulting shape is the maximum size along each dimension from the input
arrays.
Examples of Broadcasting
Example 1: Adding a Scalar to an Array
import numpy as np
arr = np.array([1, 2,
3])
scalar = 5
result = arr + scalar
print(result)
Output: [6 7 8]
Output: [[2 4 6]
[5 7 9]]
Example 3: More Complex Broadcasting
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[1], [2]])
result = arr1 + arr2
print(result)
Output: [[2 3 4]
[6 7 8]]
Practical Applications
Normalizing Data
Broadcasting is useful for normalizing data, subtracting the mean, and dividing by the standard
34 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
deviation for each feature.
35 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output:
[[-1.22474487 -1.22474487 -1.22474487]
[0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]
Element-wise Operations
Broadcasting simplifies scaling each column of a matrix by a different
factor. matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
scaling_factors = np.array([0.1, 0.2, 0.3])
scaled_matrix = matrix * scaling_factors
print(scaled_matrix)
Output:
[[0.1 0.4 0.9]
[0.4 1. 1.8]
[0.7 1.6 2.7]]
8. FANCY INDEXING
Fancy indexing, also known as advanced indexing, is a technique in data science and programming
(particularly in Python with NumPy and pandas) that allows for more flexible and powerful ways to
access and manipulate data arrays or dataframes. It involves using arrays or sequences of indices to
select specific elements or slices from an array or dataframe.
Fancy indexing refers to using arrays of indices to access multiple elements of an array
simultaneously. Instead of accessing elements one by one, you can pass a list or array of indices to
obtain a subset of elements. This technique can be used for both reading from and writing to arrays.
NumPy Fancy Indexing
NumPy is a fundamental package for scientific computing with Python, providing support
for arrays and matrices.
In NumPy, fancy indexing is done by passing arrays of indices inside square brackets. Here’s an
example:
import numpy as np
# Create a NumPy array
arr = np.array([10, 20, 30, 40, 50])
# Fancy indexing with a list of
indices indices = [0, 2, 4]
subset =
arr[indices]
print(subset)
36 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
37 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Boolean Indexing
Another form of fancy indexing is boolean indexing, where you use boolean arrays to select elements:
mask = arr > 30
subset = arr[mask]
print(subset)
38 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
9. SORTING ARRAYS
Sorting means putting elements in an ordered sequence.
Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or decending.
The NumPy ndarray object has a function called sort(). That will sort a specified array.
Sorting in NumPy
1. Simple Sorting
numpy.sort() returns a sorted copy of the array.
numpy.ndarray.sort() sorts the array in-place.
import numpy as np
arr = np.array([3, 1, 2, 5, 4]) sorted_arr =
np.sort(arr) print(sorted_arr) #
Output: [1 2 3 4 5] arr.sort()
print(arr)
Output: [1 2 3 4 5]
2. Sorting Multi-dimensional Arrays
our can sort along a specified axis using the axis
parameter. arr_2d = np.array([[3, 1, 2], [5, 4, 6]])
sorted_arr_2d = np.sort(arr_2d, axis=0) # Sort along the rows
print(sorted_arr_2d)
Output: [[3 1 2]
[5 4 6]]
sorted_arr_2d = np.sort(arr_2d, axis=1) # Sort along the columns print(sorted_arr_2d)
Output: [[1 2 3]
[4 5 6]]
3. Argsort for Indirect Sorting
numpy.argsort() returns the indices that would sort an
array. arr = np.array([3, 1, 2, 5, 4])
indices = np.argsort(arr) print(indices)
Output: [1 2 0 4 3]
sorted_arr =
arr[indices]
print(sorted_arr)
Output: [1 2 3 4 5]
39 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output: [(2, 'second', 100) (3, 'third', 150) (1, 'first', 200)]
Custom Sorting
You can use numpy.lexsort() for custom sorting.
names = np.array(['Betty', 'John', 'Alice', 'Alice'])
ages = np.array([25, 34, 30, 22])
indices = np.lexsort((ages, names))
sorted_data = list(zip(names[indices], ages[indices]))
print(sorted_data)
40 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Advanced Usage
1. Nested Structures
address_dtype = [('street', 'U20'), ('city', 'U20')]
person_dtype = [('name', 'U10'), ('age', 'i4'), ('address', address_dtype)]
data = np.array([('Alice', 25, ('123 Main St', 'Springfield')), ('Bob', 30, ('456 Elm St', 'Shelbyville'))],
dtype=person_dtype)
print(data)
Output: [('Alice', 25, ('123 Main St', 'Springfield')) ('Bob', 30, ('456 Elm St', 'Shelbyville'))]
41 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output: [(b'Alice', 25, 55.5) (b'Bob', 30, 75. ) (b'Cathy', 22, 60. )]
Output: [(b'Alice', 26, 55.5) (b'Bob', 31, 75. ) (b'Cathy', 23, 60.
)] # Modifying a record
data[1] = ('Bob', 32, 77.5)
print(data)
Output: [(b'Alice', 26, 55.5) (b'Bob', 32, 77.5) (b'Cathy', 23, 60. )]
Output: [(b'Cathy', 23, 60. ) (b'Alice', 26, 55.5) (b'Bob', 32, 77.5)]
42 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output:
Name Age Score
0 Alice 25 85.5
1 Bob 30 92.3
2 Cathy 22 78.9
44 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
]
df =
pd.DataFrame(data)
print(df)
45 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Depending on the context and the extent of the missing data, you can handle missing values using
various strategies:
Dropping Missing Values
Filling Missing Values
Interpolate Missing Values
Replace Missing Values Using Custom Functions
Example
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Detect missing values
print("Missing
values:")
print(df.isnull())
# Drop rows with any missing values
print("\nDataFrame after dropping rows with missing values:")
print(df.dropna())
# Fill missing values with the mean of each column print("\
nDataFrame after filling missing values with column means:")
print(df.fillna(df.mean()))
# Forward fill missing values
print("\nDataFrame after forward filling missing values:")
print(df.fillna(method='ffill'))
# Aggregating data
mean_scores = grouped['Score'].mean()
46 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
47 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
print(mean_scores)
# Multiple aggregations
aggregated =
grouped.agg({
'Age': 'mean',
'Score': ['mean', 'max']
})
print(aggregated)
48 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
print(s['a'])
Output: 10
By Position: Use integer positions to access
elements. print(s[0])
Output: 10
Slicing: Use slicing to get subsets of the
Series. print(s['a':'b'])
Output: a 10
b 20
dtype:
int64
print(s['a':'b'])
Output: a 10
b 20
dtype:
int64
2. Selecting Data in DataFrames
By Column Label: Access columns by their
labels. df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
print(df['A'])
Output: 0 1
1 2
2 3
Name: A, dtype: int64
By Row Label: Use .loc to access rows by index
labels. df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['x', 'y', 'z'])
print(df.loc['x'])
Output: A 1
B 4
Name: x, dtype: int64
49 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
df = pd.DataFrame({
'A': [1, 2, 3],
50 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
'B': [4, 5, 6]
}, index=['x', 'y', 'z'])
print(df.loc['x'])
Output: A 1
B 4
Name: x, dtype: int64
By Position: Use .iloc for integer-based
indexing. print(df.iloc[0])
Output: A 1
B 4
Name: x, dtype: int64
Selecting Specific Rows and
Columns # Select specific rows and
columns print(df.loc[['x', 'y'], ['A']])
Output: A
x 1
y 2
Name: A, dtype: int64
Boolean Indexing
Filtering with Conditions: Use boolean conditions to filter
rows. df = pd.DataFrame({
'A': [10, 20, 30],
'B': [40, 50, 60]
})
# Filter rows where column 'A' is greater than 15
filtered_df = df[df['A'] > 15]
print(filtered_df)
Output: A B
1 20 50
2 30 60
MultiIndex DataFrames
Creating and Selecting with MultiIndex: Use MultiIndex for hierarchical
indexing. arrays = [
['A', 'A', 'B', 'B'],
[1, 2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('letter', 'number'))
51 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output:
Output:
number #
1 10
# 2 20
# Name: value, dtype: int64
Advanced Indexing Techniques
Using .query(): Filter rows with a query string.
df = pd.DataFrame({
'A': [10, 20, 30],
'B': [40, 50, 60]
})
result = df.query('A > 15')
print(result)
Output:
A B
1 20 50
2 30 60
Using .apply() with Indexing: Apply functions to DataFrame columns
df = pd.DataFrame({
'A': [10, 20, 30],
'B': [40, 50, 60]
})
def add_ten(x):
return x +
10
df['A_plus_10'] = df['A'].apply(add_ten)
print(df)
Output:
A B A_plus_10
0 10 40 20
1 20 50 30
2 30 60 40
52 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
53 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output:
text text_ upper
0 apple APPLE
1 BANANA BANANA
2 Cherry CHERRY
# Convert to lowercase
df['text_lower'] =
df['text'].str.lower() print(df)
Output:
text text_upper text_lower
0 apple APPLE apple
1 BANANA BANANA banana
2 Cherry CHERRY cherry
Title Case
df['text_title'] = df['text'].str.title()
print(df)
Output:
text text_ upper text_ lower text_title
0 apple APPLE apple Apple
1 BANANA BANANA banana Banana
2 Cherry CHERRY cherry Cherry
2. String Methods
Length of Strings
df['text_length'] = df['text'].str.len()
print(df) 54 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Output:
text text_ upper text_lower text_title length
0 apple APPLE apple Apple 5
1 BANANA BANANA banana Banana 6
2 Cherry CHERRY cherry Cherry 6
String Containment
df['contains_e'] = df['text'].str.contains('e')
print(df)
Output:
text text_upper text_lower text_title text_length contains_e
0 apple APPLE apple Apple 5 True
1 BANANA BANANA banana Banana 6 False
2 Cherry CHERRY cherry Cherry 6 True
3. String Replacement
Replace Substrings
df['text_replaced'] = df['text'].str.replace('e', 'x', regex=False)
print(df)
Output:
text text_upper text_lower text_title text_length contains_e text_replaced
0 apple APPLE apple Apple 5 True applx
1 BANANA BANANA banana Banana 6 False BANANA
2 Cherry CHERRY cherry Cherry 6 True Chxrry
Join Strings
df['text_joined'] = df['text'].str.cat(sep='-')
print(df)
Output:
55 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
String Matching
df['match'] = df['text'].str.match(r'[A-Za-z]
+') print(df)
Output:
text text_upper text_lower text_title text_length contains_e text_split text_joined extracted match
0 apple APPLE apple Apple 5 True [appl, ] apple-BANANA-Cherry apple True
1 BANANA BANANA banana Banana 6 False[BANANA] apple-BANANA-Cherry BANANA True
2 Cherry CHERRY cherry Cherry 6 True [Ch, rry] apple-BANANA-Cherry Cherry True
******************************************************************************************************
56 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
UNIT III
MACHINE LEARNING
The modeling process - Types of machine learning - Supervised learning - Unsupervised learning –Semi
supervised learning- Classification, regression - Clustering – Outliers and Outlier Analysis.
1. MODELING PROCESS
Each step in the modeling process is crucial for building an effective and reliable machine learning
model. Ensuring attention to detail at each stage can lead to better performance and more accurate
predictions.
There are 10 steps are involved to make better machine learning model.
1. Problem Definition
2. Data Collection
3. Data Exploration and Preprocessing
4. Feature Selection
5. Model Selection
6. Model Training
7. Model Evaluation
8. Model Tuning
9. Model Deployment
10. Model Maintenance
1. Problem Definition
Objective: Clearly define the problem you are trying to solve. This includes understanding the
business or scientific objectives.
Output : Decide whether it is a classification, regression, clustering, or another type of problem.
2. Data Collection
Data collection is a crucial step in the creation of a machine learning model, as it lays the foundation
for building accurate models. In this phase of machine learning model development, relevant data is
gathered from various sources to train the machine learning model and enable it to make accurate
predictions.
Sources: Gather data from various sources such as databases, online repositories, sensors, etc.
Quality : Ensure data quality by addressing issues like missing values, inconsistencies, and errors.
3. Data Exploration and Preprocessing
Exploration: Analyze the data to understand its structure, patterns, and anomalies.
Visualization: Use plots and graphs to visualize data distributions and relationships.
Statistics: Calculate summary statistics to get a sense of the data.
Preprocessing: Prepare the data for modeling.
Cleaning: Handle missing values, outliers, and duplicates.
Transformation: Normalize or standardize data, handle categorical variables, and create
new features if necessary.
Feature Engineering: Create new features from existing ones to improve model performance.
4. Model Selection
57 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Choose Algorithms: Choose appropriate machine learning algorithms based on the problem type
(classification, regression, clustering, etc.).
Baseline Model: Develop a simple model to establish a baseline performance.
Comparison: Compare multiple algorithms using cross-validation to find the best performing one.
Feature:
Relevance: Identify and select features that are most relevant to the problem.
Techniques: Use methods like correlation analysis, mutual information, and feature importance
scores.
5. Model Training
In this phase of building a machine learning model, we have all the necessary ingredients to train
our model effectively. This involves utilizing our prepared data to teach the model to recognize
patterns and make predictions based on the input features. During the training process, we begin
by feeding the preprocessed data into the selected machine-learning algorithm.
Training Data: Split the data into training and testing sets (and sometimes validation sets).
Training Process: Fit the chosen model to the training data, optimizing its parameters.
6. Model Evaluation
Once you have trained your model, it’s time to assess its performance. There are various metrics
used to evaluate model performance, categorized based on the type of task: regression/numerical
or classification.
1. For regression tasks, common evaluation metrics are:
Mean Absolute Error (MAE): MAE is the average of the absolute differences between predicted
and actual values.
Mean Squared Error (MSE): MSE is the average of the squared differences between predicted
and actual values.
Root Mean Squared Error (RMSE): It is a square root of the MSE, providing a measure of the
average magnitude of error.
R-squared (R2): It is the proportion of the variance in the dependent variable that is
predictable from the independent variables.
2. For classification tasks, common evaluation metrics are:
Accuracy: Proportion of correctly classified instances out of the total instances.
Precision: Proportion of true positive predictions among all positive predictions.
Recall: Proportion of true positive predictions among all actual positive instances.
F1-score: Harmonic mean of precision and recall, providing a balanced measure of model
performance.
Area Under the Receiver Operating Characteristic curve (AUC-ROC): Measure of the model’s
ability to distinguish between classes.
7. Model Tuning
58 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Tuning and optimizing helps our model to maximize its performance and generalization ability.
This process involves fine-tuning hyperparameters, selecting the best algorithm, and improving
features through feature engineering techniques.
Hyperparameters are parameters that are set before the training process begins and control the
behavior of the machine learning model. These are like learning rate, regularization and
parameters of the model should be carefully adjusted.
Techniques: Use grid search, random search, or Bayesian optimization for hyperparameter tuning.
8. Model Deployment
Deploying the model and making predictions is the final stage in the journey of creating an ML
model. Once a model has been trained and optimized, it’s to integrate it into a production
environment where it can provide real-time predictions on new data.
During model deployment, it’s essential to ensure that the system can handle high user loads, operate
smoothly without crashes, and be easily updated.
Integration: Deploy the model into a production environment where it can make real-time
predictions.
Monitoring: Continuously monitor the model's performance to ensure it remains accurate and
reliable.
10. Model Maintenance
Updates: Periodically update the model with new data to maintain its performance.
Retraining: Retrain the model if there are significant changes in the data patterns or if the
model's performance degrades.
2. TYPES OF MACHINE LEARNING
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions. Machine learning contains a set
of algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and
on the basis of training, they build the model & perform a specific task.
These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types, which
are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
59 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
2.1SUPERVISED LEARNING
Supervised machine learning is based on supervision. It means in the supervised learning technique,
we train the machines using the "labelled" dataset, and based on the training, the machine predicts
the output. Here, the labelled data specifies that some of the inputs are already mapped to the output.
More preciously, we can say; first, we train the machine with the input and corresponding output,
and then we ask the machine to predict the output using the test dataset.
The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.
Example:
Let's understand supervised learning with an example. Suppose we have an input dataset of cats and
dog images. So, first, we will provide the training to the machine to understand the images, such as
the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are
smaller), etc. After completion of training, we input the picture of a cat and ask the machine to
identify the object and predict the output. Now, the machine is well trained, so it will check all the
features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it
will put it in the Cat category. This is the process of how the machine identifies the objects in
Supervised Learning.
3. Split the training dataset into training dataset, test dataset, and validation dataset.
4. Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.
2.1.1 TYPES OF SUPERVISED MACHINE LEARNING
Supervised machine learning can be classified into two types of problems, which are given below:
1. Classification
2. Regression
2.1.1.1 REGRESSION:
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
2.1.1.2 CLASSIFICATION:
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms
are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
1. Random Forest Algorithm
2. Decision Tree Algorithm
3. Logistic Regression Algorithm
4. Support Vector Machine Algorithm
Advantages of Supervised learning Algorithm:
1. Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
2. These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages of Supervised learning Algorithm:
1. These algorithms are not able to solve complex tasks.
2. It may predict the wrong output if the test data is different from the training data.
3. It requires lots of computational time to train the algorithm.
61 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
2.2UNSUPERVISED LEARNING
Unsupervised learning is different from the Supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is trained
using the unlabeled dataset, and the machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor labelled,
and the model acts on that data without any supervision.
Unsupervised machine learning analyzes and clusters unlabeled datasets using machine learning
algorithms. These algorithms find hidden patterns and data without any human intervention, i.e., we
don’t give output to our model. The training model has only input parameter values and discovers the
groups or patterns on its own.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the
hidden patterns from the input dataset.
Example:
Working of Unsupervised Learning:
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in
order
62 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.
2.2.1 TYPES OF UNSUPERVISED MACHINE LEARNING
Unsupervised Learning can be further classified into two types, which are given below:
1. Clustering
2. Association
2.2.1.1 CLUSTERING
Clustering in unsupervised machine learning is the process of grouping unlabeled data into clusters
based on their similarities. The goal of clustering is to identify patterns and relationships in the data
without any prior knowledge of the data’s meaning.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
2.2.1.1 ASSOCIATION
Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it can
generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage
mining, continuous production, etc.
For e.g. shopping stores use algorithms based on this technique to find out the relationship between
the sale of one product w.r.t to another’s sales based on customer behavior. Like if a customer buys
milk, then he may also buy bread, eggs, or butter. Once trained well, such models can be used to
increase their sales by planning different offers.
Some common clustering algorithms
63 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.
64 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Manifold assumptions
This assumption helps to use distances and densities, and this data lie on a manifold of fewer
dimensions than input space.
The dimensional data are created by a process that has less degree of freedom and may be hard to
model directly. (This assumption becomes practical if high).
2.3.2 Applications of Semi-supervised Learning:
1. Speech Analysis
It is the most classic example of semi-supervised learning applications. Since, labeling the audio data
is the most impassable task that requires many human resources, this problem can be naturally
overcome with the help of applying SSL in a Semi-supervised learning model.
2. Web content classification
However, this is very critical and impossible to label each page on the internet because it needs mode
human intervention. Still, this problem can be reduced through Semi-Supervised learning algorithms.
Further, Google also uses semi-supervised learning algorithms to rank a webpage for a given query.
3. Protein sequence classification
DNA strands are larger, they require active human intervention. So, the rise of the Semi-supervised
model has been proximate in this field.
4. Text document classifier
As we know, it would be very unfeasible to find a large amount of labeled text data, so semi-
supervised learning is an ideal model to overcome this.
3. OUTLIER
Outliers in machine learning refer to data points that are significantly different from the majority of
the data. These data points can be anomalous, noisy, or errors in measurement.
An outlier is a data point that significantly deviates from the rest of the data. It can be either much
higher or much lower than the other data points, and its presence can have a significant impact on
the results of machine learning algorithms. They can be caused by measurement or execution errors.
The analysis of outlier data is referred to as outlier analysis or outlier mining.
3.1 TYPES OF OUTLIERS
There are two main types of outliers:
1. Global outliers:
Global outliers are isolated data points that are far away from the main body of the data. They are
often easy to identify and remove.
2. Contextual outliers:
Contextual outliers are data points that are unusual in a specific context but may not be outliers in a
different context. They are often more difficult to identify and may require additional information or
domain knowledge to determine their significance.
65 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
66 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
2. Transformation:
This involves transforming the data to reduce the influence of outliers. Common methods include:
Scaling: Standardizing or normalizing the data to have a mean of zero and a standard
deviation of one.
Winsorization: Replacing outlier values with the nearest non-outlier value.
Log transformation: Applying a logarithmic transformation to compress the data and reduce
the impact of extreme values.
3. Robust Estimation:
This involves using algorithms that are less sensitive to outliers. Some examples include:
Robust regression: Algorithms like L1-regularized regression or Huber regression are less
influenced by outliers than least squares regression.
M-estimators: These algorithms estimate the model parameters based on a robust objective
function that down weights the influence of outliers.
Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are less susceptible to
the presence of outliers than K-means clustering.
4. Modeling Outliers:
This involves explicitly modeling the outliers as a separate group. This can be done by:
Adding a separate feature: Create a new feature indicating whether a data point is an outlier
or not.
Using a mixture model: Train a model that assumes the data comes from a mixture of
multiple distributions, where one distribution represents the outliers.
3.4 IMPORTANCE OF OUTLIER DETECTION IN MACHINE LEARNING
Outlier detection is important in machine learning for several reasons:
Biased models: Outliers can bias a machine learning model towards the outlier values, leading to poor
performance on the rest of the data. This can be particularly problematic for algorithms that are
sensitive to outliers, such as linear regression.
Reduced accuracy: Outliers can introduce noise into the data, making it difficult for a machine learning
model to learn the true underlying patterns. This can lead to reduced accuracy and performance.
Increased variance: Outliers can increase the variance of a machine learning model, making it more
sensitive to small changes in the data. This can make it difficult to train a stable and reliable model.
Reduced interpretability: Outliers can make it difficult to understand what a machine learning model
has learned from the data. This can make it difficult to trust the model’s predictions and can hamper
efforts to improve its performance.
3.5 TECHNIQUES FOR OUTLIER ANALYSIS:
1. Visual inspection: using plots to identify outliers
2. Statistical methods: using metrics like mean, median, and standard deviation to detect outliers
3. Machine learning algorithms: using algorithms like One-Class SVM, Local Outlier Factor (LOF),
and Isolation Forest to detect outliers
67 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
UNIT IV
DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three dimensional plotting - Geographic Data
with Basemap - Visualization with Seaborn.
68 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
The plt.axis() method allows you to set the x and y limits with a single call, by passing a list
that specifies [xmin, xmax, ymin, ymax]
plt.axis([-1, 11, -1.5, 1.5]);
Aspect ratio equal is used to represent one unit in x is equal to one unit in
y. plt.axis('equal')
Labeling Plots
The labeling of plots includes titles, axis labels, and simple
legends. Title - plt.title()
Label - plt.xlabel()
plt.ylabel()
Legend - plt.legend()
OUTPUT:
Line style:
import matplotlib.pyplot as
plt import numpy as
np fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2,
69 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
linestyle='dashdot')
70 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
plt.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following
codes: plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.')# dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
OUTPUT:
OUTPUT:
71 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Example
Various symbols used to specify ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']
Short hand assignment of line, symbol and color also
allowed. plt.plot(x, y, '-ok');
Additional arguments in plt.plot()
We can specify some other parameters related with scatter plot which makes it more attractive. They are
color, marker size, linewidth, marker face color, marker edge color, marker edge width, etc
Example
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2) plt.ylim(-1.2, 1.2);
72 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
3. VISUALIZING ERRORS
For any scientific measurement, accurate accounting for errors is nearly as important, if not more
important, than accurate reporting of the number itself. For example, imagine that I am using some
astrophysical observations to estimate the Hubble Constant, the local measurement of the expansion rate
of the Universe. In visualization of data and results, showing these errors effectively can make a plot
convey much more complete information.
Types of errors
Basic Errorbars
Continuous Errors
Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call.
import matplotlib.pyplot as
plt plt.style.use('seaborn-
whitegrid') import numpy as
np
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy *
np.random.randn(50) plt.errorbar(x, y,
yerr=dy, fmt='.k');
73 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Here the fmt is a format code controlling the appearance of lines and points, and has the same
syntax as the shorthand used in plt.plot()
In addition to these basic options, the errorbar function has many options to fine tune the
outputs. Using these additional options you can easily customize the aesthetics of your errorbar
plot.
Continuous Errors
In some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does
not have a built-in convenience routine for this type of application, it’s relatively easy to combine
primitives like plt.plot and plt.fill_between for a useful result.
Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API. This is a
method of fitting a very flexible nonparametric function to data with a continuous measure of the
uncertainty.
4. DENSITY AND CONTOUR PLOTS
To display three-dimensional data in two dimensions using contours or color-coded regions. There are three
Matplotlib functions that can be helpful for this task:
plt.contour for contour plots,
plt.contourf for filled contour plots, and
plt.imshow for showing images.
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
65 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Example
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x) x = np.linspace(0, 5, 50)
y = np.linspace(0, 5,
40) X, Y =
np.meshgrid(x, y) Z =
f(X, Y)
plt.contour(X, Y, Z, colors='black');
Notice that by default when a single color is used, negative values are represented by dashed lines, and
positive values by solid lines.
Alternatively, you can color-code the lines by specifying a colormap with the cmap argument.
We’ll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range.
plt.contour(X, Y, Z, 20, cmap='RdGy');
One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather
than continuous, which is not always what is desired.
You could remedy this by setting the number of contours to a very high number, but this results in a
rather inefficient plot: Matplotlib must render a new polygon for each step in the level.
A better way to handle this is to use the plt.imshow() function, which interprets a two-dimensional
grid of data as an image.
Example Program
66 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5,
40) X, Y =
np.meshgrid(x, y) Z =
f(X, Y)
plt.imshow(Z, extent=[0, 10, 0, 10], origin='lower', cmap='RdGy')
plt.colorbar()
5. HISTOGRAMS
Histogram is the simple plot to represent the large data set. A histogram is a graph showing
frequency distributions. It is a graph showing the number of observations within each given interval.
5. 1Parameters
plt.hist( ) is used to plot histogram. The hist() function will use an array of numbers to create a
histogram, the array is sent into the function as an argument.
bins - A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is
plotted as a bar whose height corresponds to how many data points are in that bin. Bins are also
sometimes called "intervals", "classes", or "buckets".
normed - Histogram normalization is a technique to distribute the frequencies of the histogram
over a wider range than the current range.
x - (n,) array or sequence of (n,) arrays Input values, this takes either a single array or a sequence
of arrays which are not required to be of the same length.
histtype - {'bar', 'barstacked', 'step', 'stepfilled'}, optional The type of histogram to draw.
'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged
side by side.
'barstacked' is a bar-type histogram where multiple data are stacked on top of each other.
'step' generates a lineplot that is by default unfilled.
'stepfilled' generates a lineplot that is by default filled. Default is 'bar'
align - {'left', 'mid', 'right'}, optional Controls how the histogram is plotted.
Default is None
label - str or None, optional. Default is None
Example
import numpy as np
67 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
68 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
import matplotlib.pyplot as
plt plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);
The hist() function has many options to tune both the calculation and the display; here’s an example of
a more customized histogram.
The plt.hist docstring has more information on other customization options available. I find this
combination of histtype='stepfilled' along with some transparency alpha to be very useful when comparing
histograms of several distributions
OUTPUT:
Example
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov,
1000).T plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in
bin')
OUTPUT:
69 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
6. LEGENDS
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously
saw how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the
legend in Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously
saw how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the
legend in Matplotlib
69 | P a g e
Downloaded by Vijaya K (lakshmivijik@gmail.com)
OCS353 - DATA SCIENCE FUNDAMENTALS
lines = plt.plot(x, y)
plt.legend(lines[:2],['first','second']);
# Applying label individually.
plt.plot(x, y[:, 0], label='first')
plt.plot(x, y[:, 1], label='second')
plt.plot(x, y[:, 2:])
plt.legend(framealpha=1,
frameon=True);
Example
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
x = np.linspace(0, 10, 1000)
ax.legend(loc='lower center', frameon=True,
shadow=True,borderpad=1,fancybox=True) fig
7. COLOR BARS
In Matplotlib, a color bar is a separate axes that can provide a key for the meaning of colors in a plot.
For continuous labels based on the color of points, lines, or regions, a labeled color bar can be a great tool.
We can specify the colormap using the cmap argument to the plotting function that is creating the
visualization. Broadly, we can know three different categories of colormaps:
Sequential colormaps - These consist of one continuous sequence of colors (e.g., binary or viridis).
Divergent colormaps - These usually contain two distinct colors, which show positive and
negative deviations from a mean (e.g., RdBu or PuOr).
Qualitative colormaps - These mix colors with no particular sequence (e.g., rainbow or jet).
Color limits and extensions
Matplotlib allows for a large range of colorbar customization. The colorbar itself is simply an instance
of plt.Axes, so all of the axes and tick formatting tricks we’ve learned are applicable.
We can narrow the color limits and indicate the out-of-bounds values with a triangular arrow at the
top and bottom by setting the extend property.
plt.subplot(1, 2, 2)
plt.imshow(I,
cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);
OUTPUT:
70 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Discrete colorbars
Colormaps are by default continuous, but sometimes you’d like to represent discrete values. The easiest way
to do this is to use the plt.cm.get_cmap() function, and pass the name of a suitable colormap along with the
number of desired bins.
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
8. SUBPLOTS
Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single
figure.
These subplots might be insets, grids of plots, or other more complicated layouts.
We’ll explore four routines for creating subplots in Matplotlib.
plt.axes: Subplots by Hand
plt.subplot: Simple Grids of Subplots
plt.subplots: The Whole Grid in One Go
plt.GridSpec: More Complicated Arrangements
The most basic method of creating an axes is to use the plt.axes function. As we’ve seen previously,
by default this creates a standard axes object that fills the entire figure.
plt.axes also takes an optional argument that is a list of four numbers in the figure coordinate system.
These numbers represent [bottom, left, width,height] in the figure coordinate system, which ranges
from 0 at the bottom left of the figure to 1 at the top right of the figure.
71 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
For example,
we might create an inset axes at the top-right corner of another axes by setting the x and y position to 0.65
(that is, starting at 65% of the width and 65% of the height of the figure) and the x and y extents to 0.2
(that is, the size of the axes is 20% of the width and 20% of the height of the figure).
import matplotlib.pyplot
as plt import numpy as np
ax1 = plt.axes() # standard axes
ax2 = plt.axes([0.65, 0.65, 0.2,
0.2])
OUTPUT:
OUTPUT:
We now have two axes (the top with no tick labels) that are just touching: the bottom of the upper
panel (at position 0.5) matches the top of the lower panel (at position 0.1+ 0.4).
If the axis value is changed in second plot both the plots are separated with each
other, example
ax2 = fig.add_axes([0.1, 0.01, 0.8, 0.4])
72 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
OUTPUT:
The approach just described can become quite tedious when you’re creating a large grid of subplots,
especially if you’d like to hide the x- and y-axis labels on the inner plots.
For this purpose, plt.subplots() is the easier tool to use (note the s at the end of subplots).
Rather than creating a single subplot, this function creates a full grid of subplots in a single line,
returning them in a NumPy array
Rather than creating a single subplot, this function creates a full grid of subplots in a single line,
returning them in a NumPy array.
The arguments are the number of rows and number of columns, along with optional keywords
sharex and sharey, which allow you to specify the relationships between different axes.
Here we’ll create a 2×3 grid of subplots, where all axes in the same row share their y- axis scale,
and all axes in the same column share their x-axis scale
Note that by specifying sharex and sharey, we’ve automatically removed inner labels on the grid to
make the plot cleaner.
73 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
To go beyond a regular grid to subplots that span multiple rows and columns, plt.GridSpec() is the best tool.
The plt.GridSpec() object does not create a plot by itself; it is simply a convenient interface that is
recognized by the plt.subplot() command.
For example, a gridspec for a grid of two rows and three columns with some specified width and
height space looks like this:
OUTPUT:
74 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Example
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
fig, ax =
plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
# transform=ax.transData is the default, but we'll specify it anyway
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes)
ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure);
OUTPUT:
Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the
beginning of each string will approximately mark the given coordinate location.
The transData coordinates give the usual data coordinates associated with the x- and y-axis labels. The
transAxes coordinates give the location from the bottom-left corner of the axes (here the white box) as a
fraction of the axes size.
The transfigure coordinates are similar, but specify the position from the bottom left of the figure (here
the gray box) as a fraction of the figure size.
Notice now that if we change the axes limits, it is only the transData coordinates that will be affected,
while the others remain stationary.
75 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
76 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
We enable three-dimensional plots by importing the mplot3d toolkit, included with the main Matplotlib
installation.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits
import
mplot3d fig =
plt.figure()
ax = plt.axes(projection='3d')
With this 3D axes enabled, we can now plot a variety of three-dimensional plot types.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import
mplot3d ax =
plt.axes(projection='3d')
# Data for a three-dimensional
line zline = np.linspace(0, 15,
1000) xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata,
cmap='Greens'); plt.show()
Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the
page.
78 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
ax.set_zlabel('z')
plt.show()
Sometimes the default viewing angle is not optimal, in which case we can use the view_init method to set
the elevation and azimuthal angles.
ax.view_init(60,
35) fig
Two other types of three-dimensional plots that work on gridded data are wireframes and surface plots.
These take a grid of values and project it onto the specified threedimensional surface, and can
make the resulting three-dimensional forms quite easy to visualize.
import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits
import mplot3d fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z,
color='black') ax.set_title('wireframe');
plt.show()
A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon.
Adding a colormap to the filled polygons can aid perception of the topology of the surface
being visualized
import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits
import mplot3d ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='viridis', edgecolor='none')
ax.set_title('surface')
plt.show()
Surface Triangulations
For some applications, the evenly sampled grids required by the preceding routines are overly
restrictive and inconvenient.
In these situations, the triangulation-based plots can be very useful.
import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits import mplot3d
theta = 2 * np.pi * np.random.random(1000) r = 6 *
np.random.random(1000) x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5)
79 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Matplotlib axes that understands spherical coordinates and allows us to easily over-plot data on
the map
We’ll use an etopo image (which shows topographical features both on land and under the ocean)
as the map background Program to display particular area of the map with latitude and longitude
lines
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap from itertools
import chain
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None, width=8E6, height=8E6, lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5) def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90,
13))
lons = m.drawmeridians(np.linspace(-180, 180,
13)) # keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in
lons.items())) all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='r')
The Basemap package implements several dozen such projections, all referenced by a short format code. Here
we’ll briefly demonstrate some of the more common ones.
Cylindrical projections
Pseudo-cylindrical projections
Perspective projections
Conic projections
Other cylindrical projections are the Mercator (projection='merc') and the cylindrical
equal-area (projection='cea') projections.
The additional arguments to Basemap for this view specify the latitude (lat) and longitude
(lon) of the lower-left corner (llcrnr) and upper-right corner (urcrnr) for the desired map, in
units of degrees.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None, llcrnrlat=-90, urcrnrlat=90, llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
Perspective projections are constructed using a particular choice of perspective point, similar to if
you photographed the Earth from a particular point in space (a point which, for some projections,
technically lies within the Earth!).
One common example is the orthographic projection (projection='ortho'), which shows one side of
the globe as seen from a viewer at a very long distance.
Thus, it can show only half the globe at a time.
Other perspective-based projections include the gnomonic projection (projection='gnom') and
stereographic projection (projection='stere').
These are often the most useful for showing small portions of the map.
81 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import
Basemap fig = plt.figure(figsize=(8,
8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=0)
draw_map(m);
Political boundaries
82 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
drawcountries() - Draw country boundaries drawstates() - Draw US state boundaries drawcounties() -
Draw US county boundaries
83 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Map features
drawgreatcircle() - Draw a great circle between two points
drawparallels() - Draw lines of constant latitude
drawmeridians() - Draw lines of constant longitude
drawmapscale() - Draw a linear scale on the map
Whole-globe images
bluemarble() - Project NASA’s blue marble image onto the map
shadedrelief() - Project a shaded relief image onto the map
etopo() - Draw an etopo relief image onto the map
warpimage() - Project a user-provided image onto the map
84 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very
useful for exploring correlations between multidimensional data, when you’d like to plot all pairs of values
against each other.
We’ll demo this with the Iris dataset, which lists measurements of petals and sepals of three iris species:
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue='species', size=2.5);
Faceted histograms
Sometimes the best way to view data is via histograms of subsets. Seaborn’s FacetGrid
makes this extremely simple.
We’ll take a look at some data that shows the amount that restaurant staff receive in tips based
on various indicator data
85 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution
of a parameter within bins defined by any other parameter.
Joint distributions
Similar to the pair plot we saw earlier, we can use sns.jointplot to show the joint distribution between
different datasets, along with the associated marginal distributions.
Bar plots
Time series can be plotted with sns.factorplot.
86 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
UNIT V
HANDLING LARGE
DATA
Problems - techniques for handling large volumes of data - programming tips for dealing with large data sets-
Case studies: Predicting malicious URLs, Building a recommender system - Tools and techniques needed -
Research question - Data preparation - Model building – Presentation and automation.
Handling large volumes of data requires a combination of techniques to efficiently process, store, and
analyze the data.
Some common techniques include:
1. Distributed computing:
Using frameworks like Apache Hadoop and Apache Spark to distribute data processing tasks across
multiple nodes in a cluster, allowing for parallel processing of large datasets.
2. Data compression:
Compressing data before storage or transmission to reduce the amount of space required and
improve processing speed.
3. Data partitioning:
Dividing large datasets into smaller, more manageable partitions based on certain criteria (e.g.,
range, hash value) to improve processing efficiency.
4. Data deduplication:
Identifying and eliminating duplicate data to reduce storage requirements and improve data
processing efficiency.
5. Database sharding:
Partitioning a database into smaller, more manageable parts called shards, which can be distributed
across multiple servers for improved scalability and performance.
6. Stream processing:
Processing data in real-time as it is generated, allowing for immediate analysis and decision-making
7. In-memory computing:
Storing data in memory instead of on disk to improve processing speed, particularly for frequently
accessed data
8. Parallel processing:
Using multiple processors or cores to simultaneously execute data processing tasks, improving
processing speed for large datasets.
9. Data indexing:
Creating indexes on data fields to enable faster data retrieval, especially for queries involving large
datasets.
10.Data aggregation;
Combining multiple data points into a single, summarized value to reduce the overall volume of data
while retaining important information. These techniques can be used individually or in combination
to handle large volumes of data effectively and efficiently.
87 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
When dealing with large datasets in programming, it's important to use efficient techniques to
manage memory, optimize processing speed, and avoid common pitfalls. Here are some
programming tips for dealing with large datasets:
2. Lazy loading:
Use lazy loading techniques to load data into memory only when it is needed, rather than loading the
entire dataset at once. This can help reduce memory usage and improve performance
3. Batch processing:
Process data in batches rather than all at once, especially for operations like data transformation or
analysis. This can help avoid memory issues and improve processing speed.
6. Parallel processing:
Use parallel processing techniques, such as multithreading or multiprocessing,to process the data
concurrently and take advantage of multi-core process.
3. CASE STUDIES
88 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Building a recommender system involves predicting the "rating" or "preference" that a user would
give to an item. These systems are widely used in e-commerce, social media, and content streaming
platforms to personalize recommendations for users. Here are two case studies that demonstrate
how recommender systems can be built
89 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
Dealing with large datasets requires a combination of tools and techniques to manage, process, and
analyze the data efficiently. Here are some key tools and techniques:
2. Data Storage:
Use of distributed file systems like Hadoop Distributed File System (HDFS), cloud storage services
like Amazon S3, or NoSQL databases like Apache Cassandra or MongoDB for storing large volumes of
data.
3. Data Processing:
Techniques such as MapReduce, Spark RDDs, and Spark DataFrames for parallel processing of data
across distributed computing clusters.
4. Data Streaming:
Tools like Apache Kafka or Apache Flink for processing real-time streaming data.
5. Data Compression:
Techniques like gzip, Snappy, or Parquet for compressing data to reduce storage requirements and
improve processing speed.
6. Data Partitioning:
Divide large datasets into smaller, more manageable partitions based on certain criteria to improve
processing efficiency.
7. Distributed Computing:
Use of cloud computing platforms like Amazon Web Services (AWS), Google Cloud Platform (GCP), or
Microsoft Azure for scalable and cost-effective processing of large datasets
8. Data Indexing:
Create indexes on data fields to enable faster data retrieval, especially for queries involving large
datasets.
9. Machine Learning:
Use of machine learning algorithms and libraries (e.g., scikit-learn, TensorFlow) for analyzing and
deriving insights from large datasets.
1. Data Cleaning:
Remove or correct any errors or inconsistencies in the data, such as missing values, duplicate
records, or outliers.
2. Data Integration:
Combine data from multiple sources into a single dataset, ensuring that the data is consistent and can
90 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
be analyzed together.
91 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
3. Data Transformation:
Convert the data into a format that is suitable for analysis, such as converting categorical variables
into numerical ones or normalizing numerical variables. Reduce the size
4. Data Reduction:
if the dataset by removing unnecessary features or aggregating data to a higher level of granularity.
5. Data Sampling:
If the dataset is too large to analyze in its entirety, use sampling techniques to extract a
representative subset of the data for analysis.
6. Feature Engineering:
Create new features from existing ones to improve the performance of machine learning models or
better capture the underlying patterns in the data.
7. Data Splitting:
Split the dataset into training, validation, and test sets to evaluate the performance of machine
learning models and avoid overfitting
8. Data Visualization:
Visualize the data to explore its characteristics and identify any patterns or trends that may be present.
9. Data Security:
Ensure that the data is secure and protected from unauthorized access or loss, especially when
dealing with sensitive information.
When building models for large datasets, it's important to consider scalability, efficiency, and
performance. Here are some key techniques and considerations for model building with large data:
2. Feature Selection:
Choose relevant features and reduce the dimensionality of the dataset to improve model
performance and reduce computation time.
3. Model Selection:
Use models that are scalable and efficient for large datasets, such as gradient boosting machines,
random forests, or deep learning models.
4. Batch Processing:
If real-time processing is not necessary, consider batch processing techniques to handle large
volumes of data in scheduled intervals.
5. Sampling:
Use sampling techniques to create smaller subsets of the data for model building and validation,
especially if the entire dataset cannot fit into memory
6. Incremental Learning:
Implement models that can be updated incrementally as new data becomes available, instead of
retraining the entire model from scratch.
92 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
7. Feature Engineering:
Create new features or transform existing features to better represent the underlying patterns in the
data and improve model performance.
8. Model Evaluation:
Use appropriate metrics to evaluate model performance, considering the trade-offs between
accuracy, scalability, and computational resources.
9. Parallelization:
Use parallel processing techniques within the model training process to speed up computations, such
as parallelizing gradient computations in deep learning models.
10.Data Partitioning:
Partition the data into smaller subsets for training and validation to improve efficiency and reduce
memory requirements. By employing these techniques, data scientists and machine learning
engineers can build models that are scalable, efficient, and capable of handling large datasets
effectively.
Presentation and automation are key aspects of dealing with large datasets to effectively
communicate insights and streamline data processing tasks. Here are some strategies for
presentation and automation:
1. Visualization:
Use data visualization tools like Matplotlib, Seaborn, or Tableau to create visualizations that help
stakeholders understand complex patterns and trends in the data.
2. Dashboarding:
Build interactive dashboards using tools like Power BI or Tableau that allow users to explore the
data and gain insights in real-time.
3. Automated Reporting:
Use tools like Jupyter Notebooks or R Markdown to create automated reports that can be generated
regularly with updated data.
4. Data Pipelines:
Implement data pipelines using tools like Apache Airflow or Luigi to automate data ingestion,
processing, and analysis tasks
5. Model Deployment:
Use containerization technologies like Docker to deploy machine learning models as scalable and
reusable components
7. Version Control:
Use version control systems like Git to track changes to your data processing scripts and models,
enabling collaboration and reproducibility
8. Cloud Services:
Leverage cloud services like AWS, Google Cloud Platform, or Azure for scalable storage, processing,
and deployment of large datasets and models. By incorporatingthese strategies, organizations can
streamline their data processes, improve decision-making, and derive more value from their large
datasets.
93 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
94 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
QUESTION BANK
(Regulation 2021)
95 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
UNIT I
INTRODUCTION
PART A QUESTIONS AND ANSWERS
1. What is Data
Science?
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary
approach that combines principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of data. This analysis helps data
scientists to ask and answer questions like what happened, why it happened, what will happen, and
what can be done with the results.
2. Differentiate between Data Science and Big Data.
Data Science Big data
Data Science is an area Big Data is a technique to collect, maintain
information. and process huge information.
It is about the collection, processing. analyzing, It is about the collection, processing. analyzing,
and utilizing of data in various operations. It is and utilizing of data in various operations. It is
more conceptual. more conceptual.
It is about extracting vital and valuable It is a superset of Big Data as data science
information from a huge amount of data consists of Data scrapping, cleaning,
visualization, statistics, and many more
techniques.
It is a sub-set of Data Science as mining activities It is a superset of Big Data as data science
which is in a pipeline of Data science consists of Data scrapping, cleaning,
visualization, statistics, and many more
techniques.
It is a sub-set of Data Science as mining activities It is a field of study just like Computer Science,
which is in a pipeline of Data science Applied Statistics, or Applied Mathematics
It is a technique for tracking and discovering It is a technique for tracking and discovering
trends in complex data sets. trends in complex data sets. The goal is to
make data more vital and usable i.e. by
extracting only important information from the
huge data within existing traditional aspects.
It broadly focuses on the science of the data. It is more involved with the of handling
voluminous data.
97 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
PART B QUESTIONS
1. Explain in details the Facets of Data?
2. State the differences between Big data and Data science. Mention the Benefits and uses of
data science and big data
3. Explain the data science Process and its steps in details.
4. Discuss about defining research goals and the steps involved in creating a project charter?
5. Explain in detailed about Retrieving data in Data science?
6. Explain about cleaning, integrating and transforming data in detail?
7. Explain about data Exploration which discuss about graphs?
8. Explain about basic statistical description of data?
9. Explain standard deviation and its formula with an example
10. Explain Datamining?
11. Explain in detail about Text Mining with the steps involved in it?
99 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
UNIT II
DATA MANIPULATION
PART A QUESTIONS AND ANSWERS
1. What a Python Integer contains?
A single integer in Python actually contains four pieces:
ob refent, a reference count that helps Python silently handle memory allocation and
deallocation
ob type, which encodes the type of the variable
ob size, which specifies the size of the following data members
ob digit, which contains the actual integer value that we expect the Pytin variable to represent
100 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS
Indexing of the pandas series is very slow as Indexing of numpy arrays is very fast.
compared to numpy arrays.
Pandas offer a have2d table object called Numpy is capable of providing multi-
DataFrame. dimensional arrays.
11.Define Pandas Objects.
Pandas objects can be thought of as enhanced versions of Numpy structured smos in which the rows
and columns are identified with tabels rather than simple integer indices.
101 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS
dtype: float64
17. Define the methods used in null values.
There are several useful methods for detecting, removing, and replacing nu values in Pandas data
structures. They are:
isnull()-Generate a Boolean mask indicating missing values
102 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS
notnull()-Opposite of isnull()
dropna()-Return a filtered version of the data
fillna()-Return a copy of the data with missing values filled or imputed
18. Name the categories of Join.
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one,
and many-to-many joins.
19. Mention the term GroupBy and the steps involved in it.
The name "group by" comes from a command in the SQL database language, but it is perhaps more
illuminative to think of it in the terms first coined by Hadley, Wickham of Rstats fame: split, apply,
combine.
The split step involves breaking up and grouping a DataFrame depending on the value of the
specified key
apply to interes computing some function, usully an agregate Transformation, or filtering, within the
individual groups. The combine step merges the results of these operations into an output array.
20. Define Pivot Table
A pivot table is a stopilar operation that is commonly seen in spreadsheets and other programs that
operate on tabular data. The pivot table takes simple column wise data as input, and groups the
entries into a two-dimensional cable that provides a multidimensional summarization of the data.
21. Define Hierarchical Indexing.
Hierarchical Indexes are also known as multi-indexing is setting more than one column name as the
index.
PART B QUESTIONS
1. Explain about the Datatypes in Python?
2. Exolain about the Basics of NumPy arrays with coding?
3. Describe the syntax accessing sub arrays in an arrays using array Slicing?
4. Explain in detail about Reshaping of arrays?
5. Explain about the Aggregation operation in Phyton?
6. Explain about broadcasting and its rules with examples?
7. Explain about Boolean masking in detail?
8. Explain in detail about Fancy indexing?
9. Explain in detail why there is a need for NumPy’s structured arrays?
10. Describe the installation Procedure and using of Pandas?
11. Explain the different ways of constructing a Pandas data frame?
12. Explain about the data selection in series and data frames?
13. Explain about the different operations on data in Pandas?
14. Explain how handling of missing data is done?
103 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS
15. Explain about the methods for detecting, removing and replacing Null values in Pandas data
structures.
16. Explain in detail about hierarchical indexing?
17. Describe data aggregations on Multiindices?
18. Explain in detail about combining datasets using Concat and Append?
19. Explain in detail about categories of Joins.
20. Explain about the steps involved in GroupBy with suitable diagram and coding?
UNIT III
MACHINE LEARNING
PART A QUESTIONS AND ANSWERS
1. What is Machine
Learning?
Machine learning is a subfield of artificial intelligence that involves the development of algorithms
and statistical models that enable improve their performance in tasks through experience. A
computer is said to learn from task T and improve its Performance(P) from experience E. computers
2. What are the steps defined in modelling phase. The modeling phase consists of four steps:
Feature engineering and model selection
Training the model
Model validation and selection
Applying the trained model to unseen data
Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data
set.
K-folds cross validation: This strategy divides the data set into k parts and uses each part one time as a
test data set while using the others as a training data set.
Leave-1 out: This approach is the same as k-folds but with k=1. Always leave one observation out and
train on the rest of the data.
7. What are the types of machine learning?
The two big types of machine learning techniques
1. Supervised: Learning that requires labeled data
2. Unsupervised: Learning that doesn't require labeled data but is usually less accurate or reliable
than supervised learning.
3. Semi-supervised learning is in between those techniques and is used when only a small portion
of the data is labeled.
8. What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning technique that is used to identify the category of
new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.
9. Define Regression.
A measure of the relation between the mean value of one variable A corresponding values of other
variables
10.Define regression line.
The regression line is a straight line rather than a curved line that denotes linear relationship between
the two variables.
11.What is a predictive error?
Predictive error refers to the difference between the predicted values made by some model and the
actual values. Example predictive error denntes the assumption of number of cards received and the
actual cards received.
12.What is meant by Captcha Check?
What is were hard numbers that the human user must decipher and enter into a form field before
sending the form back to the web server.
13.Define Clustering.
Clustering is the task of dividing the unlabeled data or data points into differe clusters such that similar
data points fall in the same cluster than those which differ from the others.
14. Define Principal Component Analysis Principal Component Analysis
A technique to find the latent variables in the data set while retaining as much information as possible.
15. Define Semi Supervised Learning
105 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS
106 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS
Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labeled data and a large amount of
unlabeled data to train a model. The goal o semi-supervised learning is to learn a function that can
accurately predict the output variable based on the input variables, similar to supervised learning.
16.State the learners in classification problem. Mention about Lazy Learners.
Lazy Learners: Lazy Learner firstly stores the training dataset and wait until receives the test dataset.
In Lazy learner case, classification is done on the ba of the most related data stored in the training
dataset. It takes less time in train but more time for predictions.
17.Mention the Types of ML Classification Algorithm:
Linear Models
Logistic Regression
Support vector machines
Non-linear Models
Support Vector Machines
K-Nearest Neighbours
Kernel SVM
Naïve Bayes
Decision Tree Classification
Random Forest Classification
18. State Outliner analysis.
Outlier is data object that deviates significantly from the rest of the data objects and behaves from un
different manner. An outlier the rest of the data objects significantly from the rest of the objects.
They can be caused by measurement or execution errors. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
19. State the difference between Supervised and Unsupervised
Learning Parameter
Supervised Learning
Unsupervised Learning
Input Data
Uses Known and Labeled Data as input
Uses Unknown Data as input
Computational Complexity
Less Computational Complexity
More Computational Complex
Real Time
Uses off-line analysis
Uses Real Time Analysis of Data
Number of Classes
Number of Classes are known
Number of Classes are not known
Accuracy of Results
100 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
101 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the but with some are separated by the areas of low
density. Because of this, the high density clusters can be found in any arbitrary shape.xample: K-NN
algorithm, Case-based reasoning
PART B QUESTIONS
1. Explain the steps involved in modeling process.
2. Explain the types of Machine learning.
3. Explain about Supervise learning with suitable example.
4. Explain in detail about unsupervised learning with an example.
5. Explain the steps involved in discerning digits from images under supervised Jearning.
6. Explain in detail about unsupervised learning for finding latent variables in a wine quality data set,
7. Explain about Semi Supervised Learning in detail.
8. Explain Classification Algorithm in Machine Learning
9. Explain about main clustering methods used in Machine Learning.
10. Define regression. Explain about prediction of values using the regression line
11. Write short notes on (a) Clustering Algorithms (b) Applications of Clustering techniques
12. Explain in detail about Outlier analysis.
_____________________________________________________________________________________________________________________
UNIT IV
DATA VISUALIZATION
PART A QUESTIONS AND ANSWERS
1. Define Matplotlib.
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its
numerical extension NumPy.
2. How to import Matplotlib?
matplotlib.pyplot is a collection of command style functions that make matplotlib work Ingo
MATLAB. Each pyplot une command style fundinge to a figure: eg., creates a figure, creates a plotting
area in a figure, plots some lines in a ploting area, decorates the plot with labels, etc. In
matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things
like the current figure and plotting area, and the plotting functions are directed to the current axes.
import matplotlib.pyplot as plt plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
3. How to specify line colors?
A plot is used to control the line colors and styles. The plt.plot() function takes additional arguments
that can be used to specifythese. To adjust the color, the color keyword is used, which accepts a
string argument representing virtually any imaginable color. The color can be specified in a variety of
102 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
ways.
103 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
In [6]:
plt.plot(x, np.sin(x0), color='blue') # specify color by name plt.plot(x, np.sin(x - 1), color='g') # short
color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between and 1
plt.plot(x, np.sin(x3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x4), color=(1.0,0.2,0.3)) # RGB tuple,
values 0 and 1
plt.plot(x, np.sin(x5), color='chartreuse'); # all HTML color names supported
4. Define Scatter plots.
Instead of points being joined by line segments, here in scatter plots the points are represented
individually with a dot, circle, or other shape
5. Mention the difference between plt.scatter and plt.plot.
The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where
the properties of each individual point (size, face color, edge color, etc.) can be individually
controlled or mapped to data whereas the plt.plot function only draws a line from point to point.
6. Define Contour Plot.
A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant
z slices, called contours, on a 2-dimensional format. A contour plot can be created with the
plt.contour function.
It takes three arguments:
a grid of x values,
a grid of y values, and
a grid of z values.
The x and y values represent positions on the plot, and the z values will be represented by the
contour levels.
7. Define Kernel Density Estimation.
One of the common methods of evaluating densities in multiple dimensions is Kernel Density
Estimation (KDE). KDE can be thought of as a way to "smear out" the points in space and add up the
result to obtain a smooth function.
8. Define Plot Legends.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. The
simplest legend can be created with the plt.legend() command, which automatically creates a legend
for any labeled plot elements.
9. State the categories of colormaps
There are three different categories of colormaps
Sequential colormaps-These consist of one continuous sequence of colors (e.g., binary or viridis).
Divergent colormaps-These usually contain two distinct colors, which show positive and negative
deviations from a mean (eg, RdBu or PuOr)
Qualitative colormaps-These mix colors with no particular sequence (eg rainbow or jet).
104 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
106 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
3. Explain in detail about simple Line plots with line colors, styles and axes limits.
4. Explain in detail about simple scatter plots.
5. Explain how Visualizing Errors is done.
6. Explain in detail about Density and Contour Plots.
7. Explain in detail about histogram functions with Binnings and Density.
8. Explain in detail about Customizing Plot Legends.
9. Explain in detail about Customizing Colorbars.
10. Explain in detail about Multiple Subplots.
11. Describe the example Effect of Holidays on US Births with Text and Annotation.
12. Explain in detail about Customizing Ticks.
13. Explain in detail the various built in styles in matplotlib stylesheets.
14. Explain in detail about Three-Dimensional Plotting ion Matplolib.
15. Explain in detail about Visualizing a Mobius Strip.
16. Explain about Geographic Data with Basemap with different Map Projections, Map
Background and Plotting Data on Maps.
17. Explain the example California Cities.
18. Explain the example Surface Temperature Data.
19. Explain in detail about Visualization with Seaborn.
UNIT V
HANDLING LARGE
DATA
PART A QUESTIONS AND ANSWERS
1. What are the general problems you face while handling large data managing massive
Ensuring data quality.
Keeping data secure.
Selecting the right big data tools.
Scaling systems and costs efficiently.
Lack of skilled data professionals.
Organizational resistance.
2. What are the problems encountered when working with more da in memory.
Not enough memory
Processes that never end
Some components from a bottleneck while others remain idl
Not enough speed
107 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
5. Define Perceptron
A perceptron is one of the least complex machine learr inary classification (0 or 1)
6. State how the train_observation() function works.
This function has two large parts.
The first is to calculate the prediction of an observation and compare it to the actual value.
The second part is to change the weights if the prediction seems to be wrong.
108 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
Python Tools
Python has a number of libraries that can help to deal with large data. They range from smarter data
structures over code optimizers to just-in-time compilers.
The following is a list of libraries we like to use when confronted with large data:
Cython
Numexpr
Numba
Bcolz
Blaze
Theano
Dask
13.Mention the three general programming tips for dealing with large data sets.
1. Don't reinvent the wheel.
2. Get the most out of your hardware.
3. Reduce the computing need.
109 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
SVMLight is a machine learning software package developed by Thorsten Joachims. It is designed for
solving classification and regression problems using Support Vector Machines (SVMs), which are a
type of supervised learning algorithm.
20.What is a Tree Structure?
Trees are a class of data structure that allows to retrieve information much faster than scanning
through a table. A tree always has a root value and subtrees of children, each with its children, and so
on. Simple examples would be a family tree or a biological tree and the way it splits into branches,
twigs, and leaves.
PART B QUESTIONS
1. Explain in detail about the problems that you face when handling large data.
2. Explain in detail about the general techniques for handling large volumes of data.
3. Explain the steps involved in training the perceptron by observation.
4. Explain the steps involved in train functions
5. Explain in detail about block matrix calculations with bcolz and Dask libraries.
6. Explain in detail about general programming tips for dealing with large data sets.
7. Explain the case study Predicting Malicious URLs.
8. Explain the steps for creating a model to distinguish the malicious from the normal URLs.
9. Explain in detail about Building a recommender system inside a database.
10. Explain in detail the steps involved in creating the hamming distance.
11. Explain about the customers in the database who watched the movie or not using a recommender
system.
******************************************************************************************************
Downloaded by Vijaya K
108 | P a g e
EnggTree.com
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
Downloaded from EnggTree.com
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
EnggTree.com
Downloaded from
EnggTree.com
EnggTree.co
m
Downloaded from
EnggTree.com
EnggTree.co
m
Downloaded from
EnggTree.com
EnggTree.com
www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
www.Notesfree.in
www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
www.Notesfree.in
www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
www.Notesfree.in
www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)