Data Analytics Unit I

Data analytics

Uploaded by

22p61a05g7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Data Analytics Unit I

Data analytics

Uploaded by

22p61a05g7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

UNIT-I

(Data Analytics)
Data Management: Design Data Architecture and manage the data for analysis,
understand various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values, duplicate data) and
Data Processing & Processing.

Design Data Architecture:

Data architecture is the process of standardizing how organizations collect, store,
transform, distribute, and use data. The goal is to deliver relevant data to people who
need it, when they need it, and help them make sense of it. Data architecture design
is set of standards which are composed of certain policies, rules and models.
Data is usually one of several architecture domains that form the pillars of an
enterprise architecture or solution architecture. The data architecture is formed
by dividing into three essential models
➢ Conceptual model
➢ Logical model
➢ Physical model

Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model:It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS techniques.
Physical model:
Physical models hold the database design like which type of database technology
will be suitable for architecture.
Factors that influence Data Architecture:
Few influences that can have an effect on data architecture are business policies,
business requirements, Technology used, economics, and data processing needs.
➢ Business requirements
➢ Business policies
➢ Technology in use
➢ Business economics
➢ Data processing needs
Business requirements:
These include factors such as the expansion of business, the performance
of the system access, data management, transaction management, making
use of raw data by converting them into image files and records, and then
storing in data warehouses. Data warehouses are the main aspects of storing
transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data architecture
design and also using existing licensed software purchases, database
technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an effect
on design architecture.
Data processing needs:
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.
Data management:
Data management is an administrative process that includes acquiring,
validating, storing, protecting, and processing required data to ensure the
accessibility, reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and
consuming data at unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization. Managing data effectively requires having
a data strategy and reliable methods to access, integrate, cleanse, govern, store
and prepare data for analytics. In our digital world, data pours into organizations
from many sources – operational and transactional systems, scanners, sensors,
smart devices, social media, video and text. But the value of data is not based on
its source, quality or format. Its value depends on what you do with it.
Motivation/Importance of Data management:
➢ Data management plays a significant role in an organization's ability to
generate revenue, control costs.
➢ Data management helps organizations to mitigate risks.
➢ It enables decision making in organizations.
What are the benefits of good data management?
➢ Optimum data quality
➢ Improved user confidence
➢ Efficient and timely access to data
➢ Improves decision making in an organization
Managing data Resources:
➢ An information system provides users with timely, accurate, and relevant
information.
➢ The information is stored in computer files. When files are properly arranged
and maintained, users can easily access and retrieve the information when
they need.
➢ If the files are not properly managed, they can lead to chaos in information
processing.
➢ Even if the hardware and software are excellent, the information system
can be very inefficient because of poor file management.
Areas of Data Management:
Data Modeling: Is first creating a structure for the data that you collect and use
and then organizing this data in a way that is easily accessible and efficient to
store and pull the data for reports and analysis.
Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
Data Movement: is the ability to move data from one place to another. For instance,
data needs to be moved from where it is collected to a database andthen to an
end user.
Understand various sources of the Data:
Data are the special type of information generally obtained through observations,
surveys, inquiries, or are generated as a result of human activity. Methods of data
collection are essential for anyone who wish to collect data.
Data collection is a fundamental aspect and as a result, there are different
methods of collecting data which when used on one particular set will result in
different kinds of data.
Collection of data refers to a purpose gathering of information and relevant to the
subject-matter of the study from the units under investigation. The method of
collection of data mainly depends upon the nature, purpose and the scope of inquiry
on one hand and availability of resources, and the time to the other.
Data can be generated from two types of sources namely
l. Primary sources of data
2. Secondary sources of data
l. Primary sources of data:
Primary data refers to the first hand data gathered by the researcher himself.
Sources of primary data are surveys, observations, Experimental Methods.
Survey: Survey method is one of the primary sources of data which is used
to collect quantitative information about an items in a population. Surveys are
used in different areas for collecting the data even in public and private sectors.
A survey may be conducted in the field by the researcher. The respondents are
contacted by the research person personally, telephonically or through mail. This
method takes a lot of time, efforts and money but the data collected are of high
accuracy, current and relevant to the topic.
When the questions are administered by a researcher, the survey is called a
structured interview or a researcher-administered survey.
Observations: Observation as one of the primary sources of data. Observation is a
technique for obtaining information involves measuring variables or gathering of
data necessary for measuring the variable under investigation.
Observation is defined as accurate watching and noting of phenomena as they
occur in nature with regards to cause and effect relation.
Interview: Interviewing is a technique that is primarily used to gain an
understanding of the underlying reasons and motivations for people’s attitudes,
preferences or behavior. Interviews can be undertaken on a personal one-to-one
basis or in a group.
Experimental Method: There are number of experimental designs that are used
in carrying out and experiment. However, Market researchers have used 4
experimental designs most frequently. These are
CRD - Completely Randomized Design
RBD - Randomized Block Design
LSD - Latin Square Design
FD - Factorial Designs
CRD: A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of
receiving any one treatment.
CRD is appropriate only for experiments with homogeneous experimental
units.
Example:

RBD - The term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different
blocks of land to ascertain their effect on the yield of the crop. Blocks are formed
in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment.
The production of each plot is measured after the treatment is given. These data
are then interpreted and inferences are drawn by using the analysis of Variance
Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
LSD - Latin Square Design - A Latin square is one of the experimental designs
which has a balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may be noted
that, will not get disturbed if any row gets changed with the other.

The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
FD - Factorial Designs - This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction effects of the variables and
analyzes the impacts of each of the variables. In a true experiment, randomization
is essential so that the experimenter can infer cause and effect without any bias.
A experiment which involves multiple independent variables is known as afactorial
design.
A factor is a major independent variable. In this example we have two factors:
time in instruction and setting. A level is a subdivision of a factor. In this example,
time in instruction has two levels and setting has two levels.

For example, suppose a botanist wants to understand the effects of sunlight

(low vs. high) and watering frequency (daily vs. weekly) on the growth of a
certain species of plant.

This is an example of a 2×2 factorial design because there are two

independent variables, each with two levels:
Independent variable #1: Sunlight (Levels: Low, High)
Independent variable #2: Watering Frequency (Levels: Daily, Weekly)
2. Secondary sources of Data:
While secondary sources means data collected by someone else earlier. Secondary
data are the data collected by a party not related to the
research study but collected these data for some other purpose and at different
time in the past.
If the researcher uses these data then these become secondary data for the
current users. Sources of secondary data are government publications websites,
books, journal articles, internal records.
l. Internal Sources
2. External Sources
Internal Sources –These are within the organization. External Sources - These are
outside the organization

➢ Internal Sources:
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization.
The internal sources include
Accounting resources- This gives so much information which can be used by
the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.

➢ External Sources of Data:

External Sources are sources which are outside the company in a larger
environment. Collection of external data is more difficult because the data have
much greater variety and the sources are much more numerous.
Government Publications- Government sources provide an extremely rich pool
of data for the researchers. In addition, many of these data are available free of
cost on internet websites.
There are number of government agencies generating data.
These are:
Registrar General of India- It is an office which generates demographic data. It
includes details of gender, age, occupation etc.
Central Statistical Organization- This organization publishes the national accounts
statistics. It contains estimates of national income for several years, growth rate,
and rate of major economic activities. Annual survey of Industries isalso published
by the CSO. It gives information about the total number of workers employed,
production units, material used and value added by the manufacturer.
Director General of Commercial Intelligence- This office operates from Kolkata.
It gives information about foreign trade i.e. import and export. These figures are
provided region-wise and country-wise.
Ministry of Commerce and Industries- This ministry through the office of
economic advisor provides information on wholesale price index. These indices
may be related to a number of sectors like food, fuel, power, food grains etc. It
also generates All India Consumer Price Index numbers for industrial workers,
urban, non manual employees and cultural labourers.
Planning Commission- It provides the basic statistics of Indian Economy.
Reserve Bank of India- This provides information on Banking Savings and
investment. RBI also prepares currency and finance reports.
Labour Bureau- It provides information on skilled, unskilled, white collared jobs
National Sample Survey- This is done by the Ministry of Planning and it provides
social, economic, demographic, industrial and agricultural statistics.
Department of Economic Affairs- It conducts economic survey and it also
generates information on income, consumption, expenditure, investment, savings
and foreign trade.
State Statistical Abstract- This gives information on various types of activities
related to the state like - commercial activities, education, occupation etc.
Non-Government Publications- These includes publications of various industrial
and trade associations, such as The Indian Cotton Mill Association Various
chambers of commerce.
Understand various sources of Data like Sensors/signal/GPS etc:
Sensor data:
➢ Sensor data is the output of a device that detects and responds to some
type of input from the physical environment. The output may be used to
provide information or input to another system or to guide a process.
➢ Here are a few examples of sensors, just to give an idea of the number and
diversity of their applications:
❖ A photosensor detects the presence of visible light, infrared
transmission (IR) and/or ultraviolet (UV) energy.
❖ Smart grid sensors can provide real-time data about grid
conditions, detecting outages, faults and load and triggering
alarms.
❖ Wireless sensor networks combine specialized transducers
with a communications infrastructure for monitoring and
recording conditions at diverse locations. Commonly monitored
parameters include temperature, humidity, pressure, wind
direction and speed, illumination intensity, vibration intensity,
sound intensity, powerline voltage, chemical concentrations,
pollutant levels and vital body functions.
Signal:
The simplest form of signal is a direct current (DC) that is switched on and off;
this is the principle by which the early telegraph worked. More complex signals
consist of an alternating-current (AC) or electromagnetic carrier that contains one
or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across
a network. Data and signals can be either analog or digital. A signal is periodic if it
consists of a continuously repeating pattern.
Global Positioning System (GPS):
The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or
near the Earth where there is an unobstructed line of sight to four or more GPS
satellites. The system provides critical capabilities to military, civil, and
commercial users around the world. The United States government created the
system, maintains it, and makes it freely accessible to anyone with a GPS receiver.
Quality of Data:
Data quality is the ability of your data to serve its intended purpose based on
factors such as accuracy, completeness, consistency, reliability and these factors that
play a huge role in determining data quality.
Accuracy:
Erroneous values that deviate from the expected. The causes for inaccurate data
can be various, which include:
➢ Human/computer errors during data entry and transmission
➢ Users deliberately submitting incorrect values (called disguised
missing data)
➢ Incorrect formats for input fields
➢ Duplication of training examples
Completeness:
Lacking attribute/feature values or values of interest. The dataset might be
incomplete due to:
➢ Unavailability of data
➢ Deletion of inconsistent data
➢ Deletion of data deemed irrelevant initially
Consistency: Inconsistent means data source containing discrepancies between
different data items. Some attributes representing a given concept may have
different names in different databases, causing inconsistencies and redundancies.
Naming inconsistencies may also occur for attribute values.
Reliability: Reliability means that data are reasonably complete and accurate,
meet the intended purposes, and are not subject to inappropriate alteration.
Some other features that also affect the data quality include timeliness (the data
is incomplete until all relevant information is submitted after certain time
periods), believability (how much the data is trusted by the user) and
interpretability (how easily the data is understood by all stakeholders).
To ensure high quality data, it’s crucial to preprocess it. To make the process easier,
data preprocessing is divided into four stages: data cleaning, data integration,
data reduction, and data transformation.
Data Quality is also effected by
➢ Outliers
➢ Missing Values
➢ Noisy
➢ Duplicate Values
Outliers:
Outliers are extreme values that deviate from other observations on data, they
may indicate a variability in a measurement, experimental errors or a novelty. It
is a point or an observation that deviates significantly from the other
observations.
Outlier detection from graphical representation:
➢ Scatter plot and
➢ Box plot
Scatter plot:
Scatter plots are used to plot data points on a horizontal and a vertical axis in the
attempt to show how much one variable is affected by another. A scatterplot
uses dots to represent values for two different numeric variables.

Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
• Minimum
• First quartile (Q1),
• Median,
• Third quartile (Q3), and
• Maximum”).
Most common causes of outliers on a data set:
➢ Data entry errors (human errors)
➢ Measurement errors (instrument errors)
➢ Experimental errors (data extraction or experiment
planning/executing errors)
➢ Intentional (dummy outliers made to test detection methods)
➢ Data processing errors (data manipulation or data set unintended
mutations)
➢ Sampling errors (extracting or mixing data from wrong or various
sources)
➢ Natural (not an error, novelties in data)
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing values
like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will
discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error,
data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate
outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial. If it is artificial, we can go withimputing
values. We can also use statistical model to predict values of outlier observation
and after that we can impute it with predicted values.
Missing data:
Missing data in the training data set can reduce the power / fit of a model or can lead
to a biased model because we have not analysed the behavior and relationship with
other variables correctly. It can lead to wrong prediction orclassification.
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now,
let’s identify the reasons for occurrence of these missing values. They may occur
at two stages:
l. Data Extraction: It is possible that there are problems with extraction process.
In such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find andcan be corrected easily as
well.
2. Data collection: These errors occur at time of data collection and are harderto
correct. They can be categorized in four types:
➢ Missing completely at random: This is a case when the probability of
missing variable is same for all observations. For example:
respondents of data collection process decide that they will declare
their earning after tossing a fair coin. If an head occurs, respondent
declares his / her earnings & vice versa. Here each observation has
equal chance of missing value.
➢ Missing at random: This is a case when variable is missing at random
and missing ratio varies for different values / level of other input
variables. For example: We are collecting data for age andfemale has
higher missing value compare to male.
➢ Missing that depends on unobserved predictors: This is a case when
the missing values are not random and are related to the unobserved
input variable. For example: In a medical study, if a particular
diagnostic causes discomfort, then there is higher chance of drop out
from the study. This missing value is not at random unless we have
included “discomfort” as an input variable for all patients.
➢ Missing that depends on the missing value itself: This is a case when
the probability of missing value is directly correlated with missing
value itself. For example: People with higher or lower income are likely
to provide non-response to their earning.
Which are the methods to treat missing values?
l. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
➢ In list wise deletion, we delete observations where any of the variable is
missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
➢ In pair wise deletion, we perform analysis with all cases in which the
variables of interest are present. Advantage of this method is, it keeps as
many cases available for analysis. One of the disadvantage of this method,
it uses different sample size for different variables.

➢ Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by
the mean or median (quantitative attribute) or mode (qualitative attribute) of all
known values of that variable.
It can be of two types:
➢ Generalized Imputation: In this case, we calculate the mean or median for
all non missing values of that variable then replace missing value with mean
or median. Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and then replace
missing value with it.
➢ Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.
Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways:
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace alldata in a
segment by its mean or boundary values can be used to complete thetask.
➢ Smoothing by bin means: In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
➢ Smoothing by bin median: In this method each bin value is replaced by
its bin median value.
➢ Smoothing by bin boundary: In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Duplicate values: A dataset may include data objects which are duplicates of
one another. It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of dealing with
duplicates. In most cases, the duplicates are removed so as to not give that
particular data object an advantage or bias, when running machine learning
algorithms.
Redundant data occurs while we merge data from multiple databases. If the
redundant data is not removed incorrect results will be obtained during data
analysis. Redundant data occurs due to the following reasons.

➢ Object identification: The same attribute or object may have

different names in different databases

➢ Derivable data: One attribute may be a “derived” attribute in

another table, e.g., annual revenue

Redundant attributes may be able to be detected by correlation analysis Careful

integration of the data from multiple sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed and quality
Data Pre-processing:
Data preprocessing is a data mining technique that involves transforming raw
data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain
many errors.
Major Tasks in Data Preprocessing are
➢ Data Cleaning
➢ Data Integration
➢ Data Transformation
➢ Data Reduction
l. Data Cleaning: Data is cleansed through processes such as filling in missing
values, smoothing the noisy data, or resolving the inconsistencies in the data.
Data cleaning tasks
➢ Fill in missing values
➢ Identify outliers and smooth out noisy data
➢ Correct inconsistent data
➢ Resolve redundancy caused by data integration
Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., occupation=“ ”
Noisy: containing errors or outlier value that deviate from the expected.
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
2. Data Integration: Data with different representations are put together and
conflicts within the data are resolved. Integration of multiple databases, data
cubes, or files.

There are mainly 2 major approaches for data integration – one is “Tight coupling
approach” and another is “Loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand and then sends the query directly to
the source databases to obtain the result.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These areexplained
in brief below.
➢ Schema Integration: Integrate metadata from different sources. The real-
world entities from multiple sources are referred to as the entity
identification problem.
➢ Redundancy: An attribute may be redundant if it can be derived orobtained
from another attribute or set of attributes. Inconsistencies in attributes can
also cause redundancies in the resulting data set. Some redundancies can
be detected by correlation analysis.
➢ Detection and resolution of data value conflicts: This is the third critical
issue in data integration. Attribute values from different sources may differ
for the same real-world entity. An attribute in one system may be recorded
at a lower level of abstraction than the “same” attribute in another.
3. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways.
➢ Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization:
This transforms the original data linearly. Suppose that min_F is
the minima and max_F is the maxima of an attribute, F
We Have the Formula:

Where v is the value you want to plot in the new range. v’ is the new
value you get after normalizing the old value.

➢ Attribute Selection: In this strategy, new attributes are constructed

from the given set of attributes to help the mining process.
➢ Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
➢ Concept Hierarchy Generation: Here attributes are converted from
level to higher level in hierarchy. For Example-The attribute “city” can
be converted to “country”.
4. Data Reduction: Since data mining is a technique that is used to handle huge
amount of data. While working with huge volume of data, analysis became harder
in such cases. In order to get rid of this, we uses data reduction technique. It aims
to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
➢ Data Cube Aggregation: Aggregation operation is applied to data for the
construction of the data cube.
➢ Attribute Subset Selection: The highly relevant attributes should be used,
rest all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute.the attribute having p-
value greater than significance level can be discarded.
➢ Data compression
It reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Lossless Compression:
Encoding techniques (Run Length Encoding) allows a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
Lossy Compression:
Methods such as Discrete Wavelet transform technique, PCA (principal
component analysis) are examples of this compression. In lossy-data
compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
➢ Numerosity Reduction: This enables to store the model of data instead of
whole data, for example: Regression Models.
➢ Dimensionality Reduction: This reduces the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effectivemethods
of dimensionality reduction are: Wavelet transforms and PCA (Principal
Componenet Analysis).
5. Data Discretization: Involves the reduction of a number of values of a
continuous attribute by dividing the range of attribute intervals. Data
discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Important Questions
1. Discuss about Data Management? How to manage the data for analysis?
2. How to design Data Architecture? What are the factors that influences the
data architecture?
3. Define Primary Data sources and also explain about the types of primary
data sources?
4. What is Data Pre-processing? Discuss about various steps to pre-process
the data?
5. Explain about Missing values and how to eliminate missing values?
6. Define Quality of data and discuss various factors that affect the Quality of
data?
7. Discuss about various sources of data?
8. What is Noisy data? Explain about methods to handle noisy data?
9. Discuss about different methods in experimental data sources (CRD, RBD,
LSD and FA)?
10. Differentiate internal and external secondary data sourses with examples?
11. What is data transformation in data preprocessing and discuss about
different normalization techniques?
12. Discuss the process of handling duplicate values in organizational data.
Briefly describe various sources of data like sensors, signals GPS in data
management?

Murray Sidman Coercion and Its Fallout Authors Cooperative Inc. Publishers 2001
No ratings yet
Murray Sidman Coercion and Its Fallout Authors Cooperative Inc. Publishers 2001
319 pages
Unit-1 - ADA - Notes
No ratings yet
Unit-1 - ADA - Notes
23 pages
DA-MODULE-1
No ratings yet
DA-MODULE-1
34 pages
Unit-I (Data Analytics)
No ratings yet
Unit-I (Data Analytics)
22 pages
MODULE-1
No ratings yet
MODULE-1
20 pages
DA-Unit-1-Trio-1
No ratings yet
DA-Unit-1-Trio-1
16 pages
DA Unit I
No ratings yet
DA Unit I
75 pages
DA All Units
No ratings yet
DA All Units
85 pages
Data Analytics by Srikanth Sagar
No ratings yet
Data Analytics by Srikanth Sagar
439 pages
Da Unit-I
No ratings yet
Da Unit-I
39 pages
Data Analytics Unit I
No ratings yet
Data Analytics Unit I
16 pages
Data Analytics - Unit 1
No ratings yet
Data Analytics - Unit 1
30 pages
IA Unit4
No ratings yet
IA Unit4
54 pages
1 Da
No ratings yet
1 Da
12 pages
DATA ANALYTICS Syllabus 3 Units
No ratings yet
DATA ANALYTICS Syllabus 3 Units
37 pages
Big Data Analysis Notes
No ratings yet
Big Data Analysis Notes
102 pages
Data Analytics - Unit - 1
No ratings yet
Data Analytics - Unit - 1
25 pages
Data Analytics pdf
No ratings yet
Data Analytics pdf
115 pages
DA Question Bank
No ratings yet
DA Question Bank
16 pages
Unit 2 BI & Data Science (1)
No ratings yet
Unit 2 BI & Data Science (1)
35 pages
DA NOTES-1
No ratings yet
DA NOTES-1
21 pages
IM M2-Week 3-Organization & Presentation of Data-1
No ratings yet
IM M2-Week 3-Organization & Presentation of Data-1
16 pages
Data Analytics Unit I
No ratings yet
Data Analytics Unit I
17 pages
DAFD UNit-2
No ratings yet
DAFD UNit-2
16 pages
Unit 1 Da
No ratings yet
Unit 1 Da
69 pages
BigDataAnalytics _ Unit1
No ratings yet
BigDataAnalytics _ Unit1
21 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
data modeling
No ratings yet
data modeling
6 pages
WA0002
No ratings yet
WA0002
22 pages
U1,U2 Q&A
No ratings yet
U1,U2 Q&A
21 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Module 4 - Data Mining
No ratings yet
Module 4 - Data Mining
13 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
Data Mining U-1
No ratings yet
Data Mining U-1
10 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
BDM
50% (2)
BDM
39 pages
Data Analytics BCSDS501
No ratings yet
Data Analytics BCSDS501
114 pages
Data Management
From Everand
Data Management
IntroBooks Team
No ratings yet
Data Science
No ratings yet
Data Science
68 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Made By:-Neha Rana Navneet Jain Paridhi Modi Pawan Deep Kaur Prachi Khandelwal
No ratings yet
Made By:-Neha Rana Navneet Jain Paridhi Modi Pawan Deep Kaur Prachi Khandelwal
19 pages
Zero To Mastery In Cybersecurity- Become Zero To Hero In Cybersecurity, This Cybersecurity Book Covers A-Z Cybersecurity Concepts, 2022 Latest Edition
From Everand
Zero To Mastery In Cybersecurity- Become Zero To Hero In Cybersecurity, This Cybersecurity Book Covers A-Z Cybersecurity Concepts, 2022 Latest Edition
RAJIV JAIN
No ratings yet
Data Mining Ch1
No ratings yet
Data Mining Ch1
38 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
SSI Securities Corporation
No ratings yet
SSI Securities Corporation
10 pages
01_Tutorial_ISB_L1-L2_shared
No ratings yet
01_Tutorial_ISB_L1-L2_shared
13 pages
TIS Chapter 3
No ratings yet
TIS Chapter 3
36 pages
02 Handout 1
No ratings yet
02 Handout 1
4 pages
Unit 1 1
No ratings yet
Unit 1 1
99 pages
Bes 3 (Daisy)
No ratings yet
Bes 3 (Daisy)
22 pages
المستند
No ratings yet
المستند
23 pages
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Unit 1
No ratings yet
Unit 1
61 pages
RM4Mgmt - Unit 2 - MOP - 2024 - Notes
No ratings yet
RM4Mgmt - Unit 2 - MOP - 2024 - Notes
6 pages
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
From Everand
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
Fouad Sabry
No ratings yet
Data Mining MCA 3 Sem
No ratings yet
Data Mining MCA 3 Sem
51 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
R15a0530 Bda PDF
No ratings yet
R15a0530 Bda PDF
43 pages
Introduction to Data
No ratings yet
Introduction to Data
15 pages
Nursing Research
No ratings yet
Nursing Research
165 pages
Research Methodology MCQ Questions Set-5: Answer (B) A Specific Area and Tries To Understand It in Minute Details
100% (1)
Research Methodology MCQ Questions Set-5: Answer (B) A Specific Area and Tries To Understand It in Minute Details
35 pages
Monte Carlo Simulation
No ratings yet
Monte Carlo Simulation
22 pages
Class 11th Pratice Paper
No ratings yet
Class 11th Pratice Paper
3 pages
Lab Report of Experiment
No ratings yet
Lab Report of Experiment
20 pages
BRM-MCQ Quseti
No ratings yet
BRM-MCQ Quseti
12 pages
IPL Report Template
No ratings yet
IPL Report Template
5 pages
Paper Airplane Lab
100% (1)
Paper Airplane Lab
20 pages
Week 2
No ratings yet
Week 2
13 pages
Optimization of Grinding Cycle Time For End Mill Manufacturing
100% (1)
Optimization of Grinding Cycle Time For End Mill Manufacturing
5 pages
Chapter 1 Notes - Psychology
0% (1)
Chapter 1 Notes - Psychology
42 pages
Anachem 1
No ratings yet
Anachem 1
3 pages
Rizal As Scientist
100% (3)
Rizal As Scientist
21 pages
Core Values
No ratings yet
Core Values
9 pages
3361L Syllabus FA12
No ratings yet
3361L Syllabus FA12
3 pages
research-papel-papel
No ratings yet
research-papel-papel
15 pages
Rpi Dissertation Template
100% (2)
Rpi Dissertation Template
5 pages
Conclusion
50% (2)
Conclusion
5 pages
Salt and The Boiling Temperature of Water
No ratings yet
Salt and The Boiling Temperature of Water
9 pages
Syllabus For 2014-15 M. A. - I Semester (Psychology) : Compulsory Papers
No ratings yet
Syllabus For 2014-15 M. A. - I Semester (Psychology) : Compulsory Papers
31 pages
Kerrod Uebel - Science Experiment Project
No ratings yet
Kerrod Uebel - Science Experiment Project
7 pages
Mpumalanga QP
No ratings yet
Mpumalanga QP
8 pages
Lab Report Group 4
No ratings yet
Lab Report Group 4
4 pages
Chapter 1 Introduction To Statistics For Engineers
No ratings yet
Chapter 1 Introduction To Statistics For Engineers
91 pages
CCB 3072 Process Instrumentation & Control Lab Experiment 8: Pressure Measurement Group 6
No ratings yet
CCB 3072 Process Instrumentation & Control Lab Experiment 8: Pressure Measurement Group 6
9 pages
Various Components of Research Design
50% (4)
Various Components of Research Design
7 pages
Research Methods For Business A Skill Building Approach 7th Edition Sekaran Test Bankpdf download
100% (2)
Research Methods For Business A Skill Building Approach 7th Edition Sekaran Test Bankpdf download
47 pages
Graphing Activities
No ratings yet
Graphing Activities
12 pages
A Psychological Laboratory
No ratings yet
A Psychological Laboratory
17 pages