Unit-1 - ADA - Notes
Unit-1 - ADA - Notes
UNIT-I
(Data Analytics)
Data Management: Design Data Architecture and manage the data for analysis,
understand various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values, duplicate data) and
Data Processing & Processing.
Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model:It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS techniques.
Physical model:
Physical models hold the database design like which type of database technology
will be suitable for architecture.
Factors that influence Data Architecture:
Few influences that can have an effect on data architecture are business policies,
business requirements, Technology used, economics, and data processing needs.
Business requirements
Business policies
Technology in use
Business economics
Data processing needs
Business requirements:
These include factors such as the expansion of business, the performance
of the system access, data management, transaction management, making
use of raw data by converting them into image files and records, and then
storing in data warehouses. Data warehouses are the main aspects of
storing transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data architecture
design and also using existing licensed software purchases, database
technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an effect
on design architecture.
Data processing needs:
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.
Data management:
Data management is an administrative process that includes acquiring,
validating, storing, protecting, and processing required data to ensure the
accessibility, reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and
consuming data at unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization. Managing data effectively requires having
a data strategy and reliable methods to access, integrate, cleanse, govern, store
and prepare data for analytics. In our digital world, data pours into organizations
from many sources – operational and transactional systems, scanners, sensors,
smart devices, social media, video and text. But the value of data is not based on
its source, quality or format. Its value depends on what you do with it.
Motivation/Importance of Data management:
Data management plays a significant role in an organization's ability to
generate revenue, control costs.
Data management helps organizations to mitigate risks.
It enables decision making in organizations.
What are the benefits of good data management?
Optimum data quality
Improved user confidence
Efficient and timely access to data
Improves decision making in an organization
Managing data Resources:
An information system provides users with timely, accurate, and relevant
information.
The information is stored in computer files. When files are properly
arranged and maintained, users can easily access and retrieve the
information when they need.
If the files are not properly managed, they can lead to chaos in information
processing.
Even if the hardware and software are excellent, the information system
can be very inefficient because of poor file management.
Areas of Data Management:
Data Modeling: Is first creating a structure for the data that you collect and use
and then organizing this data in a way that is easily accessible and efficient to
store and pull the data for reports and analysis.
Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
Data Movement: is the ability to move data from one place to another. For
instance, data needs to be moved from where it is collected to a database and
then to an end user.
Understand various sources of the Data:
Data are the special type of information generally obtained through observations,
surveys, inquiries, or are generated as a result of human activity. Methods of
data collection are essential for anyone who wish to collect data.
Data collection is a fundamental aspect and as a result, there are different
methods of collecting data which when used on one particular set will result in
different kinds of data.
Collection of data refers to a purpose gathering of information and relevant to the
subject-matter of the study from the units under investigation. The method of
collection of data mainly depends upon the nature, purpose and the scope of
inquiry on one hand and availability of resources, and the time to the other.
Data can be generated from two types of sources namely
l. Primary sources of data
2. Secondary sources of data
l. Primary sources of data:
Primary data refers to the first hand data gathered by the researcher himself.
Sources of primary data are surveys, observations, Experimental Methods.
Survey: Survey method is one of the primary sources of data which is used
to collect quantitative information about an items in a population. Surveys are
used in different areas for collecting the data even in public and private sectors.
A survey may be conducted in the field by the researcher. The respondents are
contacted by the research person personally, telephonically or through mail. This
method takes a lot of time, efforts and money but the data collected are of high
accuracy, current and relevant to the topic.
When the questions are administered by a researcher, the survey is called a
structured interview or a researcher-administered survey.
Observations: Observation as one of the primary sources of data. Observation is
a technique for obtaining information involves measuring variables or gathering
of data necessary for measuring the variable under investigation.
Observation is defined as accurate watching and noting of phenomena as they
occur in nature with regards to cause and effect relation.
Interview: Interviewing is a technique that is primarily used to gain an
understanding of the underlying reasons and motivations for people’s attitudes,
preferences or behavior. Interviews can be undertaken on a personal one-to-one
basis or in a group.
Experimental Method: There are number of experimental designs that are used
in carrying out and experiment. However, Market researchers have used 4
experimental designs most frequently. These are
CRD - Completely Randomized Design
RBD - Randomized Block Design
LSD - Latin Square Design
FD - Factorial Designs
CRD: A completely randomized design (CRD) is one where the treatments are
assigned completely at random so that each experimental unit has the same
chance of receiving any one treatment.
CRD is appropriate only for experiments with homogeneous experimental
units.
Example:
RBD - The term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different
blocks of land to ascertain their effect on the yield of the crop. Blocks are formed
in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment.
The production of each plot is measured after the treatment is given. These data
are then interpreted and inferences are drawn by using the analysis of Variance
Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
LSD - Latin Square Design - A Latin square is one of the experimental designs
which has a balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may be noted
that, will not get disturbed if any row gets changed with the other.
The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
FD - Factorial Designs - This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction effects of the variables and
analyzes the impacts of each of the variables. In a true experiment,
randomization is essential so that the experimenter can infer cause and effect
without any bias.
A experiment which involves multiple independent variables is known as a
factorial design.
A factor is a major independent variable. In this example we have two factors:
time in instruction and setting. A level is a subdivision of a factor. In this
example, time in instruction has two levels and setting has two levels.
Internal Sources:
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization.
The internal sources include
Accounting resources- This gives so much information which can be used by
the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.
Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
Minimum
First quartile (Q1),
Median,
Third quartile (Q3), and
Maximum”).
Most common causes of outliers on a data set:
Data entry errors (human errors)
Measurement errors (instrument errors)
Experimental errors (data extraction or experiment
planning/executing errors)
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation or data set unintended
mutations)
Sampling errors (extracting or mixing data from wrong or various
sources)
Natural (not an error, novelties in data)
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing
values like deleting observations, transforming them, binning them, treat them as
a separate group, imputing values and other statistical methods. Here, we will
discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error,
data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate
outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We
can use mean, median, mode imputation methods. Before imputing values, we
should analyse if it is natural outlier or artificial. If it is artificial, we can go with
imputing values. We can also use statistical model to predict values of outlier
observation and after that we can impute it with predicted values.
Missing data:
Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now,
let’s identify the reasons for occurrence of these missing values. They may occur
at two stages:
l. Data Extraction: It is possible that there are problems with extraction
process. In such cases, we should double-check for correct data with data
guardians. Some hashing procedures can also be used to make sure data
extraction is correct. Errors at data extraction stage are typically easy to find and
can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder
to correct. They can be categorized in four types:
Missing completely at random: This is a case when the probability
of missing variable is same for all observations. For example:
respondents of data collection process decide that they will declare
their earning after tossing a fair coin. If an head occurs, respondent
declares his / her earnings & vice versa. Here each observation has
equal chance of missing value.
Missing at random: This is a case when variable is missing at
random and missing ratio varies for different values / level of other
input variables. For example: We are collecting data for age and
female has higher missing value compare to male.
Missing that depends on unobserved predictors: This is a case
when the missing values are not random and are related to the
unobserved input variable. For example: In a medical study, if a
particular diagnostic causes discomfort, then there is higher chance
of drop out from the study. This missing value is not at random
unless we have included “discomfort” as an input variable for all
patients.
Missing that depends on the missing value itself: This is a case
when the probability of missing value is directly correlated with
missing value itself. For example: People with higher or lower income
are likely to provide non-response to their earning.
Which are the methods to treat missing values?
l. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
In list wise deletion, we delete observations where any of the variable is
missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
In pair wise deletion, we perform analysis with all cases in which the
variables of interest are present. Advantage of this method is, it keeps as
many cases available for analysis. One of the disadvantage of this method,
it uses different sample size for different variables.
Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the
model output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the
missing values with estimated ones. The objective is to employ known
relationships that can be identified in the valid values of the data set to assist in
estimating the missing values. Mean / Mode / Median imputation is one of the
most frequently used methods. It consists of replacing the missing data for a
given attribute by the mean or median (quantitative attribute) or mode
(qualitative attribute) of all known values of that variable.
It can be of two types:
Generalized Imputation: In this case, we calculate the mean or median
for all non missing values of that variable then replace missing value with
mean or median. Like in above table, variable “Manpower” is missing so we
take average of all non missing values of “Manpower” (28.33) and then
replace missing value with it.
Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.
Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways:
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
Smoothing by bin means: In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
Smoothing by bin median: In this method each bin value is replaced by
its bin median value.
Smoothing by bin boundary: In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Incorrect attribute values may due to
faulty data collection instruments
data entry problems data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Duplicate values: A dataset may include data objects which are duplicates of
one another. It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of dealing with
duplicates. In most cases, the duplicates are removed so as to not give that
particular data object an advantage or bias, when running machine learning
algorithms.
Redundant data occurs while we merge data from multiple databases. If the
redundant data is not removed incorrect results will be obtained during data
analysis. Redundant data occurs due to the following reasons.
There are mainly 2 major approaches for data integration – one is “Tight coupling
approach” and another is “Loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand and then sends the query directly to
the source databases to obtain the result.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are
explained in brief below.
Schema Integration: Integrate metadata from different sources. The real-
world entities from multiple sources are referred to as the entity
identification problem.
Redundancy: An attribute may be redundant if it can be derived or
obtained from another attribute or set of attributes. Inconsistencies in
attributes can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis.
Detection and resolution of data value conflicts: This is the third critical
issue in data integration. Attribute values from different sources may differ
for the same real-world entity. An attribute in one system may be recorded
at a lower level of abstraction than the “same” attribute in another.
3. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways.
Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization:
This transforms the original data linearly. Suppose that min_F is
the minima and max_F is the maxima of an attribute, F
We Have the Formula:
Where v is the value you want to plot in the new range. v’ is the new
value you get after normalizing the old value.