0% found this document useful (0 votes)
3 views15 pages

DWDM Lab Using Python

The document provides a comprehensive overview of data mining, detailing its processes, types, advantages, disadvantages, applications, and challenges. It also introduces the NumPy and Pandas modules, emphasizing their importance in data analysis and manipulation. Data mining is presented as a crucial tool for organizations to extract valuable insights from large datasets, while NumPy and Pandas are highlighted for their efficiency in handling data operations.

Uploaded by

Sai Bindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views15 pages

DWDM Lab Using Python

The document provides a comprehensive overview of data mining, detailing its processes, types, advantages, disadvantages, applications, and challenges. It also introduces the NumPy and Pandas modules, emphasizing their importance in data analysis and manipulation. Data mining is presented as a crucial tool for organizations to extract valuable insights from large datasets, while NumPy and Pandas are highlighted for their efficiency in handling data operations.

Uploaded by

Sai Bindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Y PADMASRI,cse

INTRODUCTION:
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery In Data Base (KDD).The knowledge discovery process includes
Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
Our Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining
techniques, Clustering in data mining, Challenges in Data mining, etc.
WHAT IS DATA MINING?
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden
patterns of information to various perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data warehouses, efficient analysis, data
mining algorithm, helping decision making and other data requirement to eventually cost-
cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of future
events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation,
on a particular data set, with an objective. This process includes various types of services
such as text mining, web mining, audio and video mining, pictorial data mining, and social
media mining. It is done through software that is simple or highly specific. By outsourcing
data mining, all the work can be done faster with low operation costs. Specialized firms can
also use new technologies to collect data that is impossible to locate manually. There are
tonnes of information available on various platforms, but very little knowledge is accessible.
The biggest challenge is to analyze the data to extract important information that can be used
to solve a problem or for company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.
Y PADMASRI,cse

TYPES OF DATA MINING:


Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables,
records, and columns from which data can be accessed in various ways without having to
recognize the database tables. Tables convey and share information, which facilitates data
searchability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data comes
from multiple places such as Marketing and Finance. The extracted data is utilized for
analytical purposes and helps in decision- making for a business organization. The data
warehouse is designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many
IT professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various kinds of
information.

Object-Relational Database:
A combination of an object-oriented database model and relational database model is
called an object-relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently utilized in
many programming languages, for example, C++, Java, C#, and so on.
Y PADMASRI,cse

Transactional Database:
A transactional database refers to a database management system (DBMS) that has
the potential to undo a database transaction if it is not performed appropriately. Even though
this was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.
ADVANTAGES OF DATA MINING:
The Data Mining technique enables organizations to obtain knowledge-based data.
 Data mining enables organizations to make lucrative modifications in operation and
production.
 Compared with other statistical data applications, data mining is a cost-efficient.
 Data Mining helps the decision-making process of an organization.
 It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviours.
 It can be induced in the new system as well as the existing platforms.
 It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.

DISADVANTAGES OF DATA MINING:


There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.

 Many data mining analytics software is difficult to operate and needs advance training
to work on.
 Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining tools
is a very challenging task.
 The data mining techniques are not precise, so that it may lead to severe consequences
in certain conditions.

DATA MINING APPLICATIONS:


Data Mining is primarily used by organizations with intense consumer
demands- Retail, Communication, Financial, marketing company, determine price,
consumer preferences, product positioning, and impact on sales, customer
satisfaction, and corporate profits. Data mining enables a retailer to use point-of-sale
records of customer purchases to develop products and promotions that help the
organization to attract the customer.
Y PADMASRI,cse

These are the following areas where data mining is widely used:
Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance health
care services and reduce costs. Analysts use data mining approaches such as Machine
learning, Multi-dimensional database, Data visualization, Soft computing, and statistics. Data
Mining can be used to forecast patients in each category. The procedures ensure that the
patients get intensive care at the right place and at the right time. Data mining also enables
healthcare insurers to recognize fraud and abuse.
Data Mining in Market Basket Analysis:
Market basket analysis is a modeling method based on a hypothesis. If you buy a
specific group of products, then you are more likely to buy another group of products. This
technique may enable the retailer to understand the purchase behavior of a buyer. This data
may assist the retailer in understanding the requirements of the buyer and altering the store's
layout accordingly. Using a different analytical comparison of results between various stores,
between customers in different demographic groups can be done.
Data mining in Education:
Education data mining is a newly emerging field, concerned with developing
techniques that explore knowledge from the data generated from educational Environments.
EDM objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use data
mining to make precise decisions and also to predict the results of the student. With the
results, the institution can concentrate on what to teach and how to teach.
Data Mining in Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing company. Data mining
tools can be beneficial to find patterns in a complex manufacturing process. Data mining can
be used in system-level designing to obtain the relationships between product architecture,
Y PADMASRI,cse

product portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.
Data Mining in CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining and holding
Customers, also enhancing customer loyalty and implementing customer-oriented strategies.
To get a decent relationship with the customer, a business organization needs to collect data
and analyze the data. With data mining technologies, the collected data can be used for
analytics.

Data Mining in Fraud detection:


Billions of dollars are lost to the action of frauds. Traditional methods of fraud
detection are a little bit time consuming and sophisticated. Data mining provides meaningful
patterns and turning data into information. An ideal fraud detection system should protect the
data of all the users. Supervised methods consist of a collection of sample records, and these
records are classified as fraudulent or non-fraudulent. A model is constructed using this data,
and the technique is made to identify whether the document is fraudulent or not.
Data Mining in Lie Detection:
Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task. Law enforcement may use data mining techniques to investigate
offenses, monitor suspected terrorist communications, etc. This technique includes text
mining also, and it seeks meaningful patterns in data, which is usually unstructured text. The
information collected from the previous investigations is compared, and a model for lie
detection is constructed.
Data Mining Financial Banking:
The Digitalization of the banking system is supposed to generate an enormous amount
of data with every new transaction. The data mining technique can help bankers by solving
business-related problems in banking and finance by identifying trends, casualties, and
correlations in business information and market costs that are not instantly evident to
managers or executives because the data volume is too large or are produced too rapidly on
the screen by experts. The manager may find these data for better targeting, acquiring,
retaining, segmenting, and maintain a profitable customer.

CHALENGES OF IMPLEMENTATION OF DATA MINING:


Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc. The
process of data mining becomes effective when the challenges or problems are correctly
recognized and adequately resolved.
Y PADMASRI,cse

Incomplete and noisy data:


The process of extracting useful data from large volumes of data is data mining. The
data in the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will
usually be inaccurate or unreliable. These problems may occur due to data measuring
instrument or because of human errors. Suppose a retail chain collects phone numbers of
customers who spend more than $ 500, and the accounting employees put the information
into their system. The person may make a digit mistake when entering the phone number,
which results in incorrect data. Even some customers may not be willing to disclose their
phone numbers, which results in incomplete data. The data could get changed due to human
or system error. All these consequences (noisy and incomplete data)makes data mining
challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository mainly
due to organizational and technical concerns. For example, various regional offices may have
their servers to store their data. It is not feasible to store, all the data from all the offices on a
central server. Therefore, data mining requires the development of tools and algorithms that
allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio
and video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data Privacy and Security:
Y PADMASRI,cse

Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it reveals
data about buying habits and preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.

NUMPY MODULE:
NumPy stands for numeric python which is a python package for the computation and
processing of the multidimensional and single dimensional array elements.
Travis Oliphant created NumPy package in 2005 by injecting the features of the
ancestor module Numeric into another module Numarrayo
It is an extension module of Python which is mostly written in C. It provides various
functions which are capable of performing the numeric computations with a high speed.
NumPy provides various powerful data structures, implementing multi-dimensional
arrays and matrices. These data structures are used for the optimal computations regarding
arrays and matrices.
THE NEED OF NUMPY:
With the revolution of data science, data analysis libraries like NumPy, SciPy,
Pandas, etc. have seen a lot of growth. With a much easier syntax than other programming
languages, python is the first choice language for the data scientist.
NumPy provides a convenient and efficient way to handle the vast amount of data. NumPy is
also very convenient with Matrix multiplication and data reshaping. NumPy is fast which
makes it reasonable to work with a large set of data.
There are the following advantages of using NumPy for data analysis.

1. NumPy performs array-oriented computing.


2. It efficiently implements the multidimensional arrays.
3. It performs scientific computations.
4. It is capable of performing Fourier Transform and reshaping the data stored in
multidimensional arrays.
5. NumPy provides the in-built functions for linear algebra and random number
generation.

SYNTAX FOR INSTALLING NUMPY:


Y PADMASRI,cse

>>>py -m pip install numpy [in command prompt]

PROGRAM USING NUMPY:

# Python program for


# Creation of Arrays
import numpy as np

# Creating a rank 1 Array


arr = np.array([1, 2, 3])
print("Array with Rank 1: \n",arr)

# Creating a rank 2 Array


arr = np.array([[1, 2, 3],
[4, 5, 6]])
print("Array with Rank 2: \n", arr)

# Creating an array from tuple


arr = np.array((1, 3, 2))
print("\n Array created using "
"passed tuple:\n", arr)

OUTPUT:
Array with Rank1:
[1 2 3]
Array With Rank2:
[[1 2 3]
[4 5 6]]

Array created using passed tuple:


[1 3 2]

PANDAS MODULE:
It is used for data analysis in Python and developed by Wes McKinney in 2008. Our
Tutorial provides all the basic and advanced concepts of Python Pandas, such as Numpy,
Data operation and Time Series.
Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. The name of Pandas is derived from the word Panel Data, which
means an Econometrics from Multidimensional data. It is used for data analysis in Python
and developed by Wes McKinney in 2008.
Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc.
There are different tools are available for fast data processing, such as Numpy, Scipy,
Cython, and Panda. But we prefer Pandas because working with Pandas is fast, simple and
more expressive than other tools.
Y PADMASRI,cse

Pandas is built on top of the Numpy package, means Numpy is required for operating the
Pandas.
Before Pandas, Python was capable for data preparation, but it only provided limited support
for data analysis. So, Pandas came into the picture and enhanced the capabilities of data
analysis. It can perform five significant steps required for processing and analysis of data
irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and analyze.
KEY FEATURES OF PANDAS:
It has a fast and efficient Data Frame object with the default and customized indexing.

 Used for reshaping and pivoting of the data sets.


 Group by data for aggregations and transformations.
 It is used for data alignment and integration of the missing data.
 Provide the functionality of Time Series.
 Process a variety of data sets in different formats like matrix data, tabular
heterogeneous, time series.
 Handle multiple operations of the data sets such as sub setting, slicing, filtering, group
By, re-ordering, and re-shaping.
 It integrates with the other libraries such as SciPy, and scikit-learn.
 Provides fast performance, and If you want to speed it, even more, you can use
the Cython.

BENEFITS OF PANDAS:
The benefits of pandas over using other language are as follows:
Data Representation: It represents the data in a form that is suited for data analysis through
its Data Frame and Series.
Clear code: The clear API of the Pandas allows you to focus on the core part of the code. So,
it provides clear and concise code for the user.
SYNTAX FOR INSTALLING PANDAS:
>>>py -m pip install pandas [In Command Prompt]
PYTHON PANDAS DATA STRUCTURES:
The Pandas provides two data structures for processing the data, i.e., Series and Data Frame,
which are discussed below:
1.SERIES:
It is defined as a one-dimensional array that is capable of storing various data types.
The row labels of series are called the index. We can easily convert the list, tuple, and
dictionary into series using "series' method. A Series cannot contain multiple columns. It has
one parameter:
Data: It can be any list, dictionary, or scalar value.
Creating Series from Array:
Before creating a Series, Firstly, we have to import the numpy module and then use array()
function in the program.
PROGRAM:
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Y PADMASRI,cse

OUTPUT:
1. P
2. a
3. n
4. d
5. a
6. s
dtype: object

Explanation: In this code, firstly, we have imported the pandas and numpy library with
the pd and np alias. Then, we have taken a variable named "info" that consist of an array of
some values. We have called the info variable through a Series method and defined it in an
"a" variable. The Series has printed by calling the print(a) method.

PYTHON PANDAS DATAFRAMES:


It is a widely used data structure of pandas and works with a two-dimensional array with
labeled axes (rows and columns). Data Frame is defined as a standard way to store data and
has two different indexes, i.e., row index and column index. It consists of the following
properties:

 The columns can be heterogeneous types like int, bool, and so on.
 It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as "columns" in case of columns and "index" in case of rows.

Create a Data Frame using List:


We can easily create a Data Frame in Pandas using list.
PROGRAM:
import pandas as pd
# a list of strings
x = ['Python', 'Pandas']
# Calling Data Frame constructor on list
df = pd.DataFrame(x)
print(df)

OUTPUT:
0

1. Python
2. Pandas

Explanation: In this code, we have defined a variable named "x" that consist of string
values. The Data Frame constructor is being called on a list to print the values.
Y PADMASRI,cse

EXCERCISE-1
1.Demonstrate the following data processing tasks using python libraries.
a. Loading a dataset.
b. Identifying the dependent and independent variables.
c. Dealing with missing data.

AIM:To Demonstrate the following data processing tasks using python libraries.
a. Loading a dataset.
b. Identifying the dependent and independent variables.
c. Dealing with missing data.

DESCRIPTION:
CSVFILE:
A csv stands for "comma separated values", which is defined as a simple file format
that uses specific structuring to arrange tabular data. It stores tabular data such as spreadsheet
or database in plain text and has a common format for data interchange. A csv file opens into
the excel sheet, and the rows and columns data define the standard format.
PYTHON CSV MODULES:
The CSV module work is used to handle the CSV files to read/write and get data from
specified columns. There are different types of CSV functions, which are as follows:

 csv.field_size_limit - It returns the current maximum field size allowed by the parser.
 csv.get_dialect - It returns the dialect associated with a name.
 csv.list_dialects - It returns the names of all registered dialects.
 csv.reader - It read the data from a csv file
 csv.register_dialect - It associates dialect with a name. The name must be a string or
a Unicode object.
 csv.writer - It writes the data to a csv file
 o csv.unregister_dialect - It deletes the dialect which is associated with the name
from the dialect registry. If a name is not a registered dialect name, then an error is
being raised.
 csv.QUOTE_ALL - It instructs the writer objects to quote all fields.
csv.QUOTE_MINIMAL - It instructs the writer objects to quote only those fields
which contain special characters such as quotechar, delimiter, etc.
Y PADMASRI,cse

 csv.QUOTE_NONNUMERIC - It instructs the writer objects to quote all the non-


numeric fields.
 csv.QUOTE_NONE - It instructs the writer object never to quote the fields.

READING CSV FILE:


Python provides various functions to read csv file. We are describing few method of reading
function.

 Using csv.reader() function

In Python, the csv.reader() module is used to read the csv file. It takes each row of the file
and makes a list of all the columns.
Sklearn MODULE:
An open-source Python package to implement machine learning models in Python is
called Scikit-learn. This library supports modern algorithms like KNN, random forest,
XGBoost, and SVC. It is constructed over NumPy. Both well-known software companies and
the Kaggle competition frequently employ Scikit-learn. It aids in various processes of model
building, like model selection, regression, classification, clustering, and dimensionality
reduction (parameter selection).
Scikit-learn is simple to work with and delivers successful performance. Scikit Learn,
though, does not enable parallel processing. We can implement deep learning algorithms in
sklearn, though it is not a wise choice, especially if using TensorFlow is an available option.
SYNTAX FOR SKLEARN MODULE INSTALLATION:
>>>py -m pip install scikit-learn

EXCEL TABLE:
FILE NAME:TAB1.csv

YEARS EXPERIENCE SALARY


1 1.1 234444
2 1.3 100000
3 1.5 1250050
4 2
5 2.2 39891
6 2.9 59234
7 45678
8 3.2 34567
9 8.2
10 67890
Y PADMASRI,cse

Source code:-

import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # used for encoding categorical
data
from sklearn.model_selection import train_test_split # used for splitting training and testing data
from sklearn.preprocessing import StandardScaler
ds=pd.read_csv('TAB1.csv')
print(ds)
x=ds.iloc[:,:-1].values
y=ds.iloc[:4].values
print("dependent variables",x)
imp=SimpleImputer(missing_values=np.nan,strategy="mean")
# This method simultaneously performs fit and transform operations on the input data and converts the data
points.

x=imp.fit_transform(x)
print("x values:",x)
print("independent",y)
y=y.reshape(-1,1)
y=imp.fit_transform(y)
print("y values",y)
Y PADMASRI,cse

Out put :-
Y PADMASRI,cse

You might also like