DA Major Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Data Analytics

Unit - 1

Data Mining

● Simply stated, data mining refers to extracting or mining " knowledge" from large
amounts of data.
● Data mining is defined as the process of discovering patterns in data. The
process must be automatic or (more usually) semi-automatic. The patterns
discovered must be meaningful in that they lead to some advantage, usually an
economic advantage. The data is invariably present in substantial quantities.
● Use this information in applications such as fraud detection, market analysis,
science exploration, production control etc.

Types of data that can be mined:


1. Flat Files - text or binary files. E.g. csv files. Application: Used in Data
Warehousing to store data, used in carrying data to and from server, etc.
2. Relational Databases - tables. Application: Data Mining, ROLAP model, etc.
3. Data Warehouse - collection of data integrated from multiple sources that will
queries and decision making. Application: Business decision making, Data
mining, etc.
4. Transactional Databases - collection of data organized by time stamps, date,
etc. to represent transactions in databases. Application: Banking, Distributed
systems, Object databases, etc.
5. Multimedia Databases - consists audio, video, images and text media
6. Spatial Databases - Store geographical information
7. Time Series Databases - contains stock exchange data and user logged
activities
8. World Wide Web (WWW) - collection of documents and resources like audio,
video, text, etc. which are identified by Uniform Resource Locators (URLs)
through web browsers.

Patterns in Data Mining

1. Associations find commonly co-occurring groupings of things, such as “ketchup


and jam” or “bread and butter” commonly purchased and observed together in a
shopping cart (i.e., market-basket analysis). Another type of association pattern
captures the sequences of things. These sequential relationships can discover
time-ordered events, such as predicting that an existing banking customer who
already has a checking account will open a savings account followed by an
investment account within a year.

2. Predictions tell the nature of future occurrences of certain events based on what
has happened in the past, such as predicting the winner of the Super Bowl or
forecasting the absolute temperature on a particular day.

3. Clusters identify natural groupings of things based on their known


characteristics, such as assigning customers in different segments based on their
demographics and past purchase behaviors.
Different types of Data Attributes

This is the first step of Data-preprocessing. We differentiate between different types of


attributes and then preprocess the data.

An attribute is an object’s property or characteristics. For example, A person’s hair


color, air humidity etc. An attribute set defines an object. The object is also referred to
as a record of the instances or entity.

1. Qualitative (Nominal (N), Ordinal (O), Binary (B) )


2. Quantitative (Numeric, Discrete , Continuous )

Qualitative Attributes
1. Nominal (Categorical) Attribute: only provide enough attributes to differentiate
between one object and another. Such as Student Roll No., Sex of the Person.
There is no order or ranking among the values.
2. Ordinal Attribute: The ordinal attribute value provides sufficient information to
order the objects, but the magnitude between the values is not actually known.
Such as Rankings, Grades, Height
3. Binary Attribute: These are 0 and 1. Where 0 is the absence of any features and
1 is the inclusion of any characteristics.
● Symmetric : Both values are equally important. E.g. Gender
● Asymmetric : Both values are not equally important. E.g. Results

Quantitative Attributes

1. Numeric attribute: It is quantitative, such that quantity can be measured and


represented in integer or real values ,are of two types
● Interval Scaled attribute: It is measured on a scale of equal size
units,these attributes allow us to compare such as temperature in C or F
and thus values of attributes have been ordered.
● Ratio Scaled attribute: Both differences and ratios are significant for Ratio.
For eg. age, length, and Weight.

2. Discrete: have finite values or countably infinite set of values, it can be numerical
and can also be in categorical form.
E.g. : Number of professions (can’t count) , zip codes(finite)

3. Continuous: have an infinite number of states. Continuous data is of float type.


E.g. height -> 5.4,6.2 etc., weight 48.9, 76.1 etc.

Discrete data key characteristics: Continuous data key


characteristics:

● You can count the data. It is ● In general, continuous


usually units counted in whole variables are not counted.
numbers. ● The values can be
● The values cannot be divided subdivided into smaller and
into smaller pieces and add smaller pieces and they have
additional meaning. additional meaning.
● You cannot measure the ● The continuous data is
data. By nature, discrete data measurable.
cannot be measured at all. For ● It has an infinite number of
example, you can measure possible values within an
your weight with the help of a interval.
scale. So, your weight is not a ● Continuous data is graphically
discrete data. displayed by histograms.
● It has a limited number of
possible values e.g. days of ● Examples of continuous data:
the month. The amount of time required to
● Discrete data is graphically complete a project, The height
displayed by a bar graph. of children.
● Examples of discrete data:The
number of students in a class,
The number of workers in a
company, The number of parts
damaged during transportation.

Statistical Description of Data

Any situation can be analyzed in two ways in data mining:

● Statistical ( Quantitative) Analysis: In statistics, data is collected, analyzed,


explored, and presented to identify patterns and trends. For eg, 10 coffees sold
per day

● Non-statistical (Qualitative) Analysis: This analysis provides generalized


information and includes sound, still images, and moving images. For eg, coffees
are available in sizes tall, grande and venti
Measures of central tendency

There are three main measures of central tendency: the mode, the median and the
mean. Each of these measures describes a different indication of the typical or central
value in the distribution.
Skewed Distributions

Mean > Median & Mode Mean < Median & Mode
Measuring Dispersion of Data ( https://youtu.be/5_TuK1yCPD4 )

There are lots of techniques available to summarize and analyze the data. Mean is one
of the important statistics that are used to summarize the center of the data. It might be
possible that data is scattered, and the mean is not enough to express that. Thus, some
other measures are used which are termed measures of dispersion. These measures
allow us to measure the scatter in the data.
Dispersion measures the extent to which the items vary from some central value.
Central value means –(mean, median,mode)
Dispersion around mean, median or mode.
But central value m sabse common value mean hoti – isliye dispersion around mean
check krte h

NEED OF DISPERSION
To check the central value which we have taken is correct or not
To tell the about the stability of series(uniformity and non-uniformity)
Also known as scatter,spread or variation.

Note: for Median data should be sorted, There are 3v types of series
Individual (1,2 4,2,3) ,
Discrete –individual series + freq (enhanced version of individual series)
Continuous series (Inclusive and Exclusive)
Best in absolute measure : Std deviation
Best in Relative measure : Coeff of Std deviation
Mostly Relative measure hi zada use hota h

Range
For Discrete & continuous series just take the largest and smallest value of x. Ignore
Frequency.

Quartile Deviation

The Quartile Deviation can be defined mathematically as half of the difference between
the upper and lower quartile. Here, quartile deviation can be represented as QD; Q3
denotes the upper quartile and Q1 indicates the lower quartile.
Quartile Deviation is also known as the Semi Interquartile range.
Variance

Variance is a simple measure of dispersion. Variance measures how far each number in
the dataset from the mean. To compute variance first, calculate the mean and squared
deviations from a mean.
Population variance

Sample variance

Observation near to mean value gets the lower result and far from means gets higher
value.

Standard Deviation

Standard deviation is a square root of the variance to get original values. Low standard
deviation indicates data points close to mean.

The normal distribution is conventional bits of help to understand the standard deviation.

Standard deviation for previous example with variance =5 is under root(5) = 2.24
Interquartile Range (IQR)

IQR is a range (the boundary between the first and second quartile) and Q3 (the
boundary between the third and fourth quartile).IQR is preferred over a range as, like a
range, IQR is not influenced by outliers. IQR is used to measure variability by splitting a
data set into four equal quartiles.

Formula to find outliers


[Q1 – 1.5 * IQR, Q3 + 1.5 * IQR]
Graphic displays of basic statistical descriptions of data

Lecture Link: https://youtu.be/yOm1PWaGGoo


Lecture Link: https://youtu.be/W6R5llNcsg8

Data Visualization Techniques


Lecture Link : https://youtu.be/f79bJTZSAqc

1. Pixel oriented Visualization Technique

2. Geometric Projection Visualization Technique

3. Icon Based Visualization Technique


4. Hierarchical Visualization Technique

Visualizing Complex data and Relations

Lecture Link : https://youtu.be/6PEVIud6HFs


Data Similarity and Dissimilarity
Lecture Link : https://youtu.be/QTOBwmwLNkM

Proximity measures of Nominal Attributes : https://youtu.be/TgCU5JGLjdM

Proximity measures of Binary Attributes : https://youtu.be/_HDGP-MqHyU


Proximity Measures Of Numerical Attributes : https://youtu.be/EDvRw1zA8g4

Minkowski Distance

Proximity Measures Of Ordinal Attributes : https://youtu.be/wVM4SC5IQPs

Proximity Measures for Mixed Attribute : https://youtu.be/aVUMDYPXLxM


Cosine Similarity: https://youtu.be/L_MMg3xys8o

Unit - 2
DATA PREPROCESSING

MAJOR TASK OF DATA PREPROCESSING :


1. DATA CLEANING
2. DATA INTEGRATION
3. DATA REDUCTION
4. DATA TRANSFORMATION
5. DATA DISCRETIZATION

DATA CLEANING LINK :


https://www.youtube.com/watch?v=QRZlYzxEFDg&list=PLV8vIYTIdSnb4H0JvSTt3PyC
NFGGlO78u&index=46
Noisy data is data with a large amount of additional meaningless information called noise. Noisy
data are data that is corrupted, distorted, or has a low Signal-to-Noise Ratio. It unnecessarily
increases the amount of storage space required and can adversely affect any data mining analysis
results.

WAYS TO REMOVE NOISE (BINNING,REGRESSION,CLUSTERING) LINK:


https://www.youtube.com/watch?v=EC_IeIBlGto

Data cleaning as a process:


● Step 1: Remove irrelevant data : should also consider removing things like hashtags,

URLs, emojis, HTML tags, etc., unless they are necessarily a part of your analysis.

● Step 2: Deduplicate your data

● Step 3: Fix structural errors : Structural errors include things like misspellings,incorrect

word use, etc.For example, if you’re running an analysis on different data sets – one with a

‘women’ column and another with a ‘female’ column, you would have to standardize the title.

● Step 4: Deal with missing data (in above pic)

● Step 5: Filter out data outliers Outliers are data points that fall far outside of the

norm and may skew your analysis too far in a certain direction

● Step 6: Validate your data such que must be answered Do you have enough data for

your needs?, Is it uniformly formatted in a design that your analysis tools can work with?

DATA INTEGRATION LINK


https://www.youtube.com/watch?v=UKUq7hZdZUw&list=PLV8vIYTIdSnb4H0JvSTt3PyCNF
GGlO78u&index=47
1. ENTITY IDENTIFICATION PROBLEM :

Entity Identification Problem occurs during the data integration. Schema matching and
object matching are imp issues here. During the integration of data from multiple
resources, some data resources match each other and they will become reductant if they
are integrated. For example: A.cust-id =B.cust-number. Here A, B are two different
database tables .cust-id is the attribute of table A,cust-number is the attribute of table B.
Here cust-id and cust-number are attributes of different tables and there is no relationship
between these tables but the cust-id attribute and cust-number attribute are taking the
same values. This is the example for the Entity Identification Problem in the relation.

2. REDUNDANCY AND CORRELATION ANALYSIS


3. TUPLE DUPLICATION

4. DATA VALUE CONFLICT DETECTION AND RESOLUTION


DATA REDUCTION:

DATA REDUCTION STRATEGIES


1. Dimension reduction : Remove
redundant attributes
2 ways to do so:

SWFS( STEP WISE FORWARD SELECTION) : select most significant attribute


EXAMPLE

SWBS( STEP WISE BACKWARD ELIMINATION): delete least significant attribute


EXAMPLE
2. Data Compression: storing info. In a compact form by applying data encodings or
transformation.
Could be either lossy(compressed data cant be decompressed to org. form) or lossless(
compressed data can be decompressed to original form)

3. Numerosity Reduction: replacing org data by smaller form of data representation


Ways:
1.parametric: only parameters of data and outliers are stored instead of actual data
E.g. : (regression, log linear model)
2. non parametric data stored in the form of histogram, clustering, sampling

4. Discretization and concept hierarchy


(https://www.youtube.com/watch?v=KFE6vLE1Xtg)

DATA TRANSFORMATION:

NUMERICAL ON NOMRALIZATION:
Decision tree: https://www.youtube.com/watch?v=oeKBs41MkNo

Topics not covered


1. Regression and log linear model: parametric data reduction,sampling data
2. Discretization by cluster
3. Correlation analysis
4. Concept hierarchy generation for normal data
Unit - 3

A Data Warehouse is a
● Subject-oriented - Subject areas might be Customers, Products, orders etc.
● Integrated - Data obtained from several, separate sources: standardize
● Time-variant - All data in the data warehouse is associated with time
stamp
● Non-volatile - No updates, only periodic refreshment with a new snapshot

collection of data for supporting management’s decisional needs

Need of Data Warehouse

1. Information Crisis - available data is not readily usable for strategic


decision making.

– The types of information needed to make decisions in the formulation and


execution of business strategies and objectives are broad-based and encompass the
entire organization. We may combine all these types of essential information into
one group and call it strategic information.
Characteristics of strategic information

2. Inability to Provide Information – IT receives too many ad hoc requests,


resulting in a large overhead.
– Requests keep on changing all the time.
– The users need more reports to understand the earlier reports.

3. Failure of decision support systems – Strategic information was gathered


(collected) from the operational systems which are not designed or intended
to provide strategic information. ( Evolution of Info Sys Environment )

Types of Systems in an organization

● Operational systems
● Informational systems

Operational Data + External Data == Informational Data

Heterogeneous Database Integration


1. Query Driven (on-demand) - after a query comes the data is sorted
accordingly and the query is resolved on demand.(Inefficient and expensive
for aggrg.)

2. Update Driven - information from multiple, heterogeneous sources is


integrated in advance and stored in a ware-house for direct querying and
analysis. (Data Warehousing uses update driven approach)

Two perspectives of Data Warehousing

1. The Organizational Perspective


2. The Technological Perspective

********************** The Organizational Perspective


**********************

IS Penetration into Organizations


● Operations - : (Transaction Processing Systems; OLTP), front office etc.
● Management - Data mining, statical analysis,Dimensional Modeling and
Analysis (OLAP)
Ques :
Data Warehousing :- Data warehousing is the process of constructing and using a
data warehouse.
– Not just a storehouse of data but a process, an architecture, an environment and
infrastructure.

Data Warehouse Modeling


● Data Warehouse == Dimensional Modeling
● Dimension eg:- Time, vendor, geography, product etc.

Data Warehouse models


1. Enterprise Data warehouse – An EDW is a data warehouse that
encompasses and stores all of an organization’s data from sources across the
entire business. A smaller data warehouse may be specific to a business
department or line of business (like a data mart). In contrast, an EDW is
intended to be a single repository for all of an organization’s data.

2. Virtual Data Warehouse – A virtual data warehouse is a set of separate


databases, which can be queried together, so a user can effectively access all
the data as if it was stored in one data warehouse.

3. Data Mart – is used for business-line specific reporting and analysis. In


this data warehouse model, data is aggregated from a range of source
systems relevant to a specific business area, such as sales or finance.

Data Cube or OLAP - https://www.geeksforgeeks.org/data-cube-or-olap-


approach-in-data-mining/
Slice—one dimension fixed
Dice –two or more dimension fixed
Pivot—rotation /ViewPoint acc to diff dimension

Uses of Data Warehouse


– Query & Reporting
– Statistical Analysis
– [ Dimensional ] Analytical Processing (OLAP, ...)
– Data Mining
– Decision support - trend analysis, ratios and rankings

** OLAP (Online Analytical Processing)

An OLAP system is designed to process large amounts of data quickly, allowing


users to analyze multiple data dimensions in tandem. Teams can use this data for
decision-making and problem-solving.
– Any type of Data warehouse system is an OLAP system. (Netflix
recommendations)
– It is subject-oriented. Used for Data Mining, Analytics, Decisions making, etc.
– It provides a multi-dimensional view of different business tasks.
– A large amount of data is stored typically in TB, PB
– Relatively slow as the amount of data involved is large. Queries may take hours.

** OLTP (Online transaction processing)

OLTP systems are designed to handle large volumes of transactional data


involving multiple users. OLTP administers the day-to-day transactions of an
organization.
– ATM center is an OLTP application.
– It is application-oriented. Used for business tasks.
– It reveals a snapshot of present business tasks.
– The size of the data is relatively small as the historical data is archived. For ex
MB, GB
– Very Fast as the queries operate on 5% of the data.
********************** The Technological Perspective
**********************

IT Support To Organizations

Types of Data
1. Internal Data
2. External Data
3. Metadata - Data about data

Used as a

● Directory to locate the contents of the warehouse


● Guide to data mapping from the operational system
● Guide to summarisation algorithms used

Metadata is information about

● Structure of data
● Data extraction/transformation history
● Data Usage statistics
● Data summarisation/modeling algorithms

Types of MetaData

1. Business Metadata - It has the data ownership information, business


definition, and changing policies.

2. Technical Metadata - It includes database system names, table and column


names and sizes, data types and allowed values. Technical metadata also
includes structural information such as primary and foreign key attributes
and indices.

3. Operational Metadata - It includes currency of data and data lineage.


Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation
applied on it.

Role of Metadata
Data Mart

A Data Mart is a subset of the information content of a warehouse designed for a


specific purpose, stored in its own separate storage area.
– created along functional/departmental lines
– queries need data locally available in the Data Mart
– improves query performance by reducing data size.

Populating data marts is very expensive. Therefore Keep number of marts small.
The reasons to create a data mart −

1. To partition data in order to impose access control strategies.


2. To speed up the queries by reducing the volume of data to be scanned.
3. To segment data into different hardware platforms.
4. To structure data in a form suitable for a user access tool.

Types of Data Mart:-

1. Dependent data mart - data comes from the central data w/h that already
exists.
2. Independent data mart - data can come from operational Database or
external source or both

To make data marting cost-effective −

● Identify the Functional Splits


● Identify User Access Tool Requirements
● Identify Access Control Issues
Top-Down Versus Bottom-Up Approach

Top-Down Approach
Advantages :
● An enterprise view of data
● Inherently architected—not a union of disparate data marts
● Single, central storage of data about the content
● Centralized rules and control
● May see quick results if implemented with iterations
Disadvantages are:
● Takes longer to build even with an iterative method
● High exposure/risk to failure
● Needs high level of cross-functional skills

Bottom-Up Approach

Advantages :
● Faster and easier implementation of manageable pieces
● Early return on investment
● Less risk of failure
● Inherently incremental; can schedule important data marts first
● Allows project team to learn and grow

Disadvantages :
● Each data mart has its own narrow view of data
● Redundant data in every data mart
● Perpetuates inconsistent and irreconcilable data
***************** IMPORTANT ********************

** ETL Process
** Operational Data Store vs Data Warehouse

Structure of Data Warehouse

** Fact and Dimensions (Dimension modeling)


** Data Warehouse Schemas

You might also like