Data Analytics Unit I
Data Analytics Unit I
(Data Analytics)
Data Management: Design Data Architecture and manage the data for analysis,
understand various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values, duplicate data) and
Data Processing & Processing.
Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model:It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS techniques.
Physical model:
Physical models hold the database design like which type of database technology
will be suitable for architecture.
Factors that influence Data Architecture:
Few influences that can have an effect on data architecture are business policies,
business requirements, Technology used, economics, and data processing needs.
➢ Business requirements
➢ Business policies
➢ Technology in use
➢ Business economics
➢ Data processing needs
Business requirements:
These include factors such as the expansion of business, the performance
of the system access, data management, transaction management, making
use of raw data by converting them into image files and records, and then
storing in data warehouses. Data warehouses are the main aspects of storing
transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data architecture
design and also using existing licensed software purchases, database
technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an effect
on design architecture.
Data processing needs:
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.
Data management:
Data management is an administrative process that includes acquiring,
validating, storing, protecting, and processing required data to ensure the
accessibility, reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and
consuming data at unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization. Managing data effectively requires having
a data strategy and reliable methods to access, integrate, cleanse, govern, store
and prepare data for analytics. In our digital world, data pours into organizations
from many sources – operational and transactional systems, scanners, sensors,
smart devices, social media, video and text. But the value of data is not based on
its source, quality or format. Its value depends on what you do with it.
Motivation/Importance of Data management:
➢ Data management plays a significant role in an organization's ability to
generate revenue, control costs.
➢ Data management helps organizations to mitigate risks.
➢ It enables decision making in organizations.
What are the benefits of good data management?
➢ Optimum data quality
➢ Improved user confidence
➢ Efficient and timely access to data
➢ Improves decision making in an organization
Managing data Resources:
➢ An information system provides users with timely, accurate, and relevant
information.
➢ The information is stored in computer files. When files are properly arranged
and maintained, users can easily access and retrieve the information when
they need.
➢ If the files are not properly managed, they can lead to chaos in information
processing.
➢ Even if the hardware and software are excellent, the information system
can be very inefficient because of poor file management.
Areas of Data Management:
Data Modeling: Is first creating a structure for the data that you collect and use
and then organizing this data in a way that is easily accessible and efficient to
store and pull the data for reports and analysis.
Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
Data Movement: is the ability to move data from one place to another. For instance,
data needs to be moved from where it is collected to a database andthen to an
end user.
Understand various sources of the Data:
Data are the special type of information generally obtained through observations,
surveys, inquiries, or are generated as a result of human activity. Methods of data
collection are essential for anyone who wish to collect data.
Data collection is a fundamental aspect and as a result, there are different
methods of collecting data which when used on one particular set will result in
different kinds of data.
Collection of data refers to a purpose gathering of information and relevant to the
subject-matter of the study from the units under investigation. The method of
collection of data mainly depends upon the nature, purpose and the scope of inquiry
on one hand and availability of resources, and the time to the other.
Data can be generated from two types of sources namely
l. Primary sources of data
2. Secondary sources of data
l. Primary sources of data:
Primary data refers to the first hand data gathered by the researcher himself.
Sources of primary data are surveys, observations, Experimental Methods.
Survey: Survey method is one of the primary sources of data which is used
to collect quantitative information about an items in a population. Surveys are
used in different areas for collecting the data even in public and private sectors.
A survey may be conducted in the field by the researcher. The respondents are
contacted by the research person personally, telephonically or through mail. This
method takes a lot of time, efforts and money but the data collected are of high
accuracy, current and relevant to the topic.
When the questions are administered by a researcher, the survey is called a
structured interview or a researcher-administered survey.
Observations: Observation as one of the primary sources of data. Observation is a
technique for obtaining information involves measuring variables or gathering of
data necessary for measuring the variable under investigation.
Observation is defined as accurate watching and noting of phenomena as they
occur in nature with regards to cause and effect relation.
Interview: Interviewing is a technique that is primarily used to gain an
understanding of the underlying reasons and motivations for people’s attitudes,
preferences or behavior. Interviews can be undertaken on a personal one-to-one
basis or in a group.
Experimental Method: There are number of experimental designs that are used
in carrying out and experiment. However, Market researchers have used 4
experimental designs most frequently. These are
CRD - Completely Randomized Design
RBD - Randomized Block Design
LSD - Latin Square Design
FD - Factorial Designs
CRD: A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of
receiving any one treatment.
CRD is appropriate only for experiments with homogeneous experimental
units.
Example:
RBD - The term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different
blocks of land to ascertain their effect on the yield of the crop. Blocks are formed
in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment.
The production of each plot is measured after the treatment is given. These data
are then interpreted and inferences are drawn by using the analysis of Variance
Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
LSD - Latin Square Design - A Latin square is one of the experimental designs
which has a balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may be noted
that, will not get disturbed if any row gets changed with the other.
The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
FD - Factorial Designs - This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction effects of the variables and
analyzes the impacts of each of the variables. In a true experiment, randomization
is essential so that the experimenter can infer cause and effect without any bias.
A experiment which involves multiple independent variables is known as afactorial
design.
A factor is a major independent variable. In this example we have two factors:
time in instruction and setting. A level is a subdivision of a factor. In this example,
time in instruction has two levels and setting has two levels.
➢ Internal Sources:
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization.
The internal sources include
Accounting resources- This gives so much information which can be used by
the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.
Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
• Minimum
• First quartile (Q1),
• Median,
• Third quartile (Q3), and
• Maximum”).
Most common causes of outliers on a data set:
➢ Data entry errors (human errors)
➢ Measurement errors (instrument errors)
➢ Experimental errors (data extraction or experiment
planning/executing errors)
➢ Intentional (dummy outliers made to test detection methods)
➢ Data processing errors (data manipulation or data set unintended
mutations)
➢ Sampling errors (extracting or mixing data from wrong or various
sources)
➢ Natural (not an error, novelties in data)
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing values
like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will
discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error,
data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate
outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial. If it is artificial, we can go withimputing
values. We can also use statistical model to predict values of outlier observation
and after that we can impute it with predicted values.
Missing data:
Missing data in the training data set can reduce the power / fit of a model or can lead
to a biased model because we have not analysed the behavior and relationship with
other variables correctly. It can lead to wrong prediction orclassification.
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now,
let’s identify the reasons for occurrence of these missing values. They may occur
at two stages:
l. Data Extraction: It is possible that there are problems with extraction process.
In such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find andcan be corrected easily as
well.
2. Data collection: These errors occur at time of data collection and are harderto
correct. They can be categorized in four types:
➢ Missing completely at random: This is a case when the probability of
missing variable is same for all observations. For example:
respondents of data collection process decide that they will declare
their earning after tossing a fair coin. If an head occurs, respondent
declares his / her earnings & vice versa. Here each observation has
equal chance of missing value.
➢ Missing at random: This is a case when variable is missing at random
and missing ratio varies for different values / level of other input
variables. For example: We are collecting data for age andfemale has
higher missing value compare to male.
➢ Missing that depends on unobserved predictors: This is a case when
the missing values are not random and are related to the unobserved
input variable. For example: In a medical study, if a particular
diagnostic causes discomfort, then there is higher chance of drop out
from the study. This missing value is not at random unless we have
included “discomfort” as an input variable for all patients.
➢ Missing that depends on the missing value itself: This is a case when
the probability of missing value is directly correlated with missing
value itself. For example: People with higher or lower income are likely
to provide non-response to their earning.
Which are the methods to treat missing values?
l. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
➢ In list wise deletion, we delete observations where any of the variable is
missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
➢ In pair wise deletion, we perform analysis with all cases in which the
variables of interest are present. Advantage of this method is, it keeps as
many cases available for analysis. One of the disadvantage of this method,
it uses different sample size for different variables.
➢ Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by
the mean or median (quantitative attribute) or mode (qualitative attribute) of all
known values of that variable.
It can be of two types:
➢ Generalized Imputation: In this case, we calculate the mean or median for
all non missing values of that variable then replace missing value with mean
or median. Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and then replace
missing value with it.
➢ Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.
Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways:
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace alldata in a
segment by its mean or boundary values can be used to complete thetask.
➢ Smoothing by bin means: In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
➢ Smoothing by bin median: In this method each bin value is replaced by
its bin median value.
➢ Smoothing by bin boundary: In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Duplicate values: A dataset may include data objects which are duplicates of
one another. It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of dealing with
duplicates. In most cases, the duplicates are removed so as to not give that
particular data object an advantage or bias, when running machine learning
algorithms.
Redundant data occurs while we merge data from multiple databases. If the
redundant data is not removed incorrect results will be obtained during data
analysis. Redundant data occurs due to the following reasons.
There are mainly 2 major approaches for data integration – one is “Tight coupling
approach” and another is “Loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand and then sends the query directly to
the source databases to obtain the result.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These areexplained
in brief below.
➢ Schema Integration: Integrate metadata from different sources. The real-
world entities from multiple sources are referred to as the entity
identification problem.
➢ Redundancy: An attribute may be redundant if it can be derived orobtained
from another attribute or set of attributes. Inconsistencies in attributes can
also cause redundancies in the resulting data set. Some redundancies can
be detected by correlation analysis.
➢ Detection and resolution of data value conflicts: This is the third critical
issue in data integration. Attribute values from different sources may differ
for the same real-world entity. An attribute in one system may be recorded
at a lower level of abstraction than the “same” attribute in another.
3. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways.
➢ Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization:
This transforms the original data linearly. Suppose that min_F is
the minima and max_F is the maxima of an attribute, F
We Have the Formula:
Where v is the value you want to plot in the new range. v’ is the new
value you get after normalizing the old value.