DA Major Notes
DA Major Notes
DA Major Notes
Unit - 1
Data Mining
● Simply stated, data mining refers to extracting or mining " knowledge" from large
amounts of data.
● Data mining is defined as the process of discovering patterns in data. The
process must be automatic or (more usually) semi-automatic. The patterns
discovered must be meaningful in that they lead to some advantage, usually an
economic advantage. The data is invariably present in substantial quantities.
● Use this information in applications such as fraud detection, market analysis,
science exploration, production control etc.
2. Predictions tell the nature of future occurrences of certain events based on what
has happened in the past, such as predicting the winner of the Super Bowl or
forecasting the absolute temperature on a particular day.
Qualitative Attributes
1. Nominal (Categorical) Attribute: only provide enough attributes to differentiate
between one object and another. Such as Student Roll No., Sex of the Person.
There is no order or ranking among the values.
2. Ordinal Attribute: The ordinal attribute value provides sufficient information to
order the objects, but the magnitude between the values is not actually known.
Such as Rankings, Grades, Height
3. Binary Attribute: These are 0 and 1. Where 0 is the absence of any features and
1 is the inclusion of any characteristics.
● Symmetric : Both values are equally important. E.g. Gender
● Asymmetric : Both values are not equally important. E.g. Results
Quantitative Attributes
2. Discrete: have finite values or countably infinite set of values, it can be numerical
and can also be in categorical form.
E.g. : Number of professions (can’t count) , zip codes(finite)
There are three main measures of central tendency: the mode, the median and the
mean. Each of these measures describes a different indication of the typical or central
value in the distribution.
Skewed Distributions
Mean > Median & Mode Mean < Median & Mode
Measuring Dispersion of Data ( https://youtu.be/5_TuK1yCPD4 )
There are lots of techniques available to summarize and analyze the data. Mean is one
of the important statistics that are used to summarize the center of the data. It might be
possible that data is scattered, and the mean is not enough to express that. Thus, some
other measures are used which are termed measures of dispersion. These measures
allow us to measure the scatter in the data.
Dispersion measures the extent to which the items vary from some central value.
Central value means –(mean, median,mode)
Dispersion around mean, median or mode.
But central value m sabse common value mean hoti – isliye dispersion around mean
check krte h
NEED OF DISPERSION
To check the central value which we have taken is correct or not
To tell the about the stability of series(uniformity and non-uniformity)
Also known as scatter,spread or variation.
Note: for Median data should be sorted, There are 3v types of series
Individual (1,2 4,2,3) ,
Discrete –individual series + freq (enhanced version of individual series)
Continuous series (Inclusive and Exclusive)
Best in absolute measure : Std deviation
Best in Relative measure : Coeff of Std deviation
Mostly Relative measure hi zada use hota h
Range
For Discrete & continuous series just take the largest and smallest value of x. Ignore
Frequency.
Quartile Deviation
The Quartile Deviation can be defined mathematically as half of the difference between
the upper and lower quartile. Here, quartile deviation can be represented as QD; Q3
denotes the upper quartile and Q1 indicates the lower quartile.
Quartile Deviation is also known as the Semi Interquartile range.
Variance
Variance is a simple measure of dispersion. Variance measures how far each number in
the dataset from the mean. To compute variance first, calculate the mean and squared
deviations from a mean.
Population variance
Sample variance
Observation near to mean value gets the lower result and far from means gets higher
value.
Standard Deviation
Standard deviation is a square root of the variance to get original values. Low standard
deviation indicates data points close to mean.
The normal distribution is conventional bits of help to understand the standard deviation.
Standard deviation for previous example with variance =5 is under root(5) = 2.24
Interquartile Range (IQR)
IQR is a range (the boundary between the first and second quartile) and Q3 (the
boundary between the third and fourth quartile).IQR is preferred over a range as, like a
range, IQR is not influenced by outliers. IQR is used to measure variability by splitting a
data set into four equal quartiles.
Minkowski Distance
Unit - 2
DATA PREPROCESSING
URLs, emojis, HTML tags, etc., unless they are necessarily a part of your analysis.
● Step 3: Fix structural errors : Structural errors include things like misspellings,incorrect
word use, etc.For example, if you’re running an analysis on different data sets – one with a
‘women’ column and another with a ‘female’ column, you would have to standardize the title.
● Step 5: Filter out data outliers Outliers are data points that fall far outside of the
norm and may skew your analysis too far in a certain direction
● Step 6: Validate your data such que must be answered Do you have enough data for
your needs?, Is it uniformly formatted in a design that your analysis tools can work with?
Entity Identification Problem occurs during the data integration. Schema matching and
object matching are imp issues here. During the integration of data from multiple
resources, some data resources match each other and they will become reductant if they
are integrated. For example: A.cust-id =B.cust-number. Here A, B are two different
database tables .cust-id is the attribute of table A,cust-number is the attribute of table B.
Here cust-id and cust-number are attributes of different tables and there is no relationship
between these tables but the cust-id attribute and cust-number attribute are taking the
same values. This is the example for the Entity Identification Problem in the relation.
DATA TRANSFORMATION:
NUMERICAL ON NOMRALIZATION:
Decision tree: https://www.youtube.com/watch?v=oeKBs41MkNo
A Data Warehouse is a
● Subject-oriented - Subject areas might be Customers, Products, orders etc.
● Integrated - Data obtained from several, separate sources: standardize
● Time-variant - All data in the data warehouse is associated with time
stamp
● Non-volatile - No updates, only periodic refreshment with a new snapshot
● Operational systems
● Informational systems
IT Support To Organizations
Types of Data
1. Internal Data
2. External Data
3. Metadata - Data about data
Used as a
● Structure of data
● Data extraction/transformation history
● Data Usage statistics
● Data summarisation/modeling algorithms
Types of MetaData
Role of Metadata
Data Mart
Populating data marts is very expensive. Therefore Keep number of marts small.
The reasons to create a data mart −
1. Dependent data mart - data comes from the central data w/h that already
exists.
2. Independent data mart - data can come from operational Database or
external source or both
Top-Down Approach
Advantages :
● An enterprise view of data
● Inherently architected—not a union of disparate data marts
● Single, central storage of data about the content
● Centralized rules and control
● May see quick results if implemented with iterations
Disadvantages are:
● Takes longer to build even with an iterative method
● High exposure/risk to failure
● Needs high level of cross-functional skills
Bottom-Up Approach
Advantages :
● Faster and easier implementation of manageable pieces
● Early return on investment
● Less risk of failure
● Inherently incremental; can schedule important data marts first
● Allows project team to learn and grow
●
Disadvantages :
● Each data mart has its own narrow view of data
● Redundant data in every data mart
● Perpetuates inconsistent and irreconcilable data
***************** IMPORTANT ********************
** ETL Process
** Operational Data Store vs Data Warehouse