Unit - 2 Notes - BADS
Unit - 2 Notes - BADS
)
Data: Data Collection, Data Management, Big Data Management, Organization/sources of
data, Types of Data, Importance of data quality, Dealing with missing or incomplete data, Data
Visualization, Data Classification
Data Science Project Life Cycle: Business Requirement, Data Acquisition, Data Preparation,
Hypothesis and Modelling, Evaluation and Interpretation, Deployment, Operations,
optimization.
Data collection
Below are seven steps you can use to collect data:
1. Identify opportunities for data collection
The first step to collecting data is to find an opportunity where that process would be useful
for you or your organization. This can include situations such as needing sales information,
wanting to understand public opinion about a product, brand or company or wanting to
preserve certain information for later use. Sometimes, the opportunities for data collection
come from challenges presented in the workplace, such as inefficient processes or decreasing
sales, which you can address and overcome by collecting data and making changes through
data-based decision-making. There may also be more than one opportunity from which to
choose.
2. Select opportunities and set goals
Once you understand the opportunities available to you for data collection, you can choose
which ones you want to pursue. Some may be easier than others, but many have positive
benefits, such as increased sales or having more information to work with. After you select
the opportunities you want to pursue, set goals for how you want the data to be used and
why that data is important to you or your organization. This can help you further refine your
data collection and create short- and long-term projects you can assign to project teams.
3. Create a plan and set methods for data collection
Once you have your opportunities and goals prepared, you can create plans for short- and
long-term projects to complete. This includes assigning teams to work on those projects,
creating ways to store information and planning for how much information you want to
gather. This is also where you can choose which methods of data collection you use, which
depends on the type of data you want to collect. For example, using questionnaires is a good
way to gather information about public opinion, while monitoring social media and online
marketing can produce data as numbers you can use for other projects.
4. Validate your systems of measurement
Once you create your plans for data collection and decide on a method you want to use,
ensure that your systems of measurement are accurate and that you are gathering the
information you want. Accurate measurements are important for data collection because the
data you collect can help you make decisions and having accurate data measurements,
especially for quantitative data, can help you make the best decisions for your organization.
This step is also important because it can help you understand the best methods of data
collection for your particular needs and help you create rigorous methods for your data.
5. Collect data
After creating plans and methods for collecting the data you want, you can enact those plans.
This can be the longest step in the data collection process because you may have thousands
of data points you want to collect depending on the type of information you want and the
method you use. Some methods, such as social media monitoring, are an ongoing process,
so setting a start and end date for data collection can help you process the information more
accurately. Others, such as interviews and research, can be short to implement and do but
have longer interpretation times.
6. Analyze data
After you finish collecting your data or set a time to stop the data collection for those methods
that are ongoing, you can analyze your data. This includes organizing it in a way that you and
project teams can understand and then interpreting that data to inform your decision-
making. Often, a combination of qualitative and quantitative data is useful for decision-
making. For example, collecting opinions about your brand after noticing that its sales are
declining can help you make decisions that increase public opinion and sales once you
understand your data.
7. Act based on the data
The last step for data collection is to understand how to react to the information you gather.
For example, if your qualitative data about marketing shows that sales are increasing steadily,
without further input from the marketing team, then you can continue to act as you have
been. Similarly, if your data shows sales are decreasing, then you can form a plan to increase
marketing efforts and even create a new product that consumers may want to use. Either
decision is data-based decision-making and can help you make excellent choices for your
organization.
Data Management
Put simply, a data management strategy is an organization’s roadmap for using data to
achieve its goals. This roadmap ensures that all the activities surrounding data
management—from collection to collaboration—work together effectively and efficiently to
be as useful as possible and easy to govern. With a data management strategy in place, your
company can avoid some of these common data challenges:
Incompatible, duplicate, or missing data from undocumented or inconsistently
documented sources
Siloed projects that use the same data, yet duplicate the efforts and costs associated
with that data
Data activities that consume time and resources but do not contribute to overall
business objectives
A data management strategy will be the strong foundation needed for consistent project
approaches, successful integration, and business growth.
Prepare
How will you clean and transform raw data to prepare it for analysis?
How will you identify incomplete or disparate data?
What will be the guidelines for naming data, documenting lineage, and adding
metadata to increase discoverability?
Store
Where will you store your data?
Will you use XML, CSV, or relational databases for structured data?
Do you need a data lake for unstructured data?
How will you keep your data secure?
IT vendors offer big data platforms and managed services that combine many of those
technologies in a single package, primarily for use in the cloud. For organizations that want
to deploy big data systems themselves, either on premises or in the cloud, various tools are
available in addition to Hadoop and Spark. They include the following categories of tools:
Storage repositories.
Cluster management frameworks.
Stream processing engines.
NoSQL databases.
Data lake and data warehouse platforms.
SQL query engines.
Sources of data
What are the sources of data?
In short, the sources of data are physical or digital places where information is stored in a
data table, data object, or some other storage format.
Secondary data
When data collection happens outside of the organization, it is called an external data source.
In every way, they are outside of the company. As a researcher, you can work for external data
collection.
The data from external origins is harder to gather because it is much more varied, and there
can be many of them. There are different groups into which external data can be put. They
are given below:
Government publications
Researchers can get a massive amount of information from government sources. Also, you
can get much of this information for free on the Internet.
Non-government publications
Researchers can also find industry-related information in non-government publications. The
only research problem with non-government publications is that their data may sometimes
be biased.
Syndicate services
Some companies offer Syndicate services. As part of this, they collect and organize the same
marketing information for all their clients. Surveys, mail diary panels, electronic services,
wholesalers, industrial firms, retailers, etc., are ways they get information from households.
Types of data
As defined above – primary and secondary
Additionally, failing to implement data quality standards can result in GDPR fines. Under GDPR
compliance laws, businesses can face fines of up to £17.5 million or 4% of the preceding
financial year’s global turnover – whichever is higher.
These fines highlight the importance of keeping data clean and of high quality. Not only for
better business performance, but also for data compliance.
It is therefore not necessary for every value to be flawless; this is why there will be different
levels of good quality in different datasets. It is ideal to remember that good quality datasets
do not have a universal criterion, but a proactive approach to data quality management and
improving poor quality data is crucial.
Uniqueness: Data is considered high quality when it is unique. This helps ensure that there is
no duplication in values across the dataset, keeping data clean and precise. Removing
duplicate entries can help avoid sending multiple marketing communications to the same
contact, reducing costs and protecting the brand image.
Completeness: Data is complete when the dataset contains all the necessary information
required to carry out specific activities. Completeness does not mean that every possible
entry has to be full – it is about fulfilling relevant data entries specific to the intended activity.
For example, an email marketing database would require a full set of email addresses in order
to be complete, but it would not require phone numbers in order to carry out the core
activities.
Consistency: Consistency refers to how well the data entries follow the same format
throughout the dataset. To ensure consistency, the same data values and formatting should
be used throughout. For example, phone numbers should all be presented in the same way
for each contact, such as 07 vs +44.
Accuracy: Accuracy is one of the most important characteristics of high-quality data. This
refers to how well the data reflects reality. For instance, a postcode that is not truly reflective
of the contact’s address would be inaccurate. Businesses need reliable information to make
informed decisions. Inaccurate data needs to be identified, documented, and fixed to make
sure they have the highest quality information possible. It is essential that the data used in
marketing and advertising is accurate in order to ensure that communications target active
customers and prevent mistakes.
Timeliness: Timeliness refers to how readily available the data is. Data needs to be easily
accessible in order to be useful. If not, then this can hinder the performance of campaigns –
especially where time is of the essence.
Validity: The validity of information refers to the format it is presented in. For example,
birthdays can be formatted in different ways: day/month/year or month/day/year. This
format can vary depending on the country, industry, or business standards. In order for data
to be valid, it needs to be entered in the way that the data system recognises. For instance,
the birthday 14/05/1998 would be invalid in a system that formats birthdays in the
month/day/year format – since months of the year do not exceed 12.
Even in a well-designed and controlled study, missing data occurs in almost all research.
Missing data can reduce the statistical power of a study and can produce biased estimates,
leading to invalid conclusions. This manuscript reviews the problems and types of missing
data, along with the techniques for handling missing data. The mechanisms by which missing
data occurs are illustrated, and the methods for handling the missing data are discussed. The
paper concludes with recommendations for the handling of missing data.
Missing data (or missing values) is defined as the data value that is not stored for a variable
in the observation of interest. The problem of missing data is relatively common in almost all
research and can have a significant effect on the conclusions that can be drawn from the data.
Accordingly, some studies have focused on handling the missing data, problems caused by
missing data, and the methods to avoid or minimize such in medical research.
However, until recently, most researchers have drawn conclusions based on the assumption
of a complete data set. The general topic of missing data has attracted little attention in the
field of anesthesiology.
Missing data present various problems. First, the absence of data reduces statistical power,
which refers to the probability that the test will reject the null hypothesis when it is false.
Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the
representativeness of the samples. Fourth, it may complicate the analysis of the study. Each
of these distortions may threaten the validity of the trials and can lead to invalid conclusions.
It is not uncommon to have a considerable amount of missing data in a study. One technique
of handling the missing data is to use the data analysis methods which are robust to the
problems caused by the missing data. An analysis method is considered robust to the missing
data when there is confidence that mild to moderate violations of the assumptions will
produce little to no bias or distortion in the conclusions drawn on the population. However,
it is not always possible to use such techniques.
Therefore, a number of alternative ways of handling the missing data has been developed.
Listwise or case deletion
By far the most common approach to the missing data is to simply omit those cases with the
missing data and analyze the remaining data. This approach is known as the complete case
(or available case) analysis or listwise deletion.
Listwise deletion is the most frequently used method in handling missing data, and thus has
become the default option for analysis in most statistical software packages. Some
researchers insist that it may introduce bias in the estimation of the parameters. However, if
the assumption of MCAR is satisfied, a listwise deletion is known to produce unbiased
estimates and conservative results. When the data do not fulfill the assumption of MCAR,
listwise deletion may cause bias in the estimates of the parameters [9].
If there is a large enough sample, where power is not an issue, and the assumption of MCAR
is satisfied, the listwise deletion may be a reasonable strategy. However, when there is not a
large sample, or the assumption of MCAR is not satisfied, the listwise deletion is not the
optimal strategy.
Pairwise deletion
Pairwise deletion eliminates information only when the particular data-point needed to test
a particular assumption is missing. If there is missing data elsewhere in the data set, the
existing values are used in the statistical testing. Since a pairwise deletion uses all information
observed, it preserves more information than the listwise deletion, which may delete the case
with any missing data. This approach presents the following problems: 1) the parameters of
the model will stand on different sets of data with different statistics, such as the sample size
and standard errors; and 2) it can produce an intercorrelation matrix that is not positive
definite, which is likely to prevent further analysis [10].
Pairwise deletion is known to be less biased for the MCAR or MAR data, and the appropriate
mechanisms are included as covariates. However, if there are many missing observations, the
analysis will be deficient.
Mean substitution
In a mean substitution, the mean value of a variable is used in place of the missing data value
for that same variable. This allows the researchers to utilize the collected data in an
incomplete dataset. The theoretical background of the mean substitution is that the mean is
a reasonable estimate for a randomly selected observation from a normal distribution.
However, with missing values that are not strictly random, especially in the presence of a
great inequality in the number of missing values for the different variables, the mean
substitution method may lead to inconsistent bias. Furthermore, this approach adds no new
information but only increases the sample size and leads to an underestimate of the errors.
Thus, mean substitution is not generally accepted.
Regression imputation
Imputation is the process of replacing the missing data with estimated values. Instead of
deleting any case that has any missing value, this approach preserves all cases by replacing
the missing data with a probable value estimated by other available information. After all
missing values have been replaced by this approach, the data set is analyzed using the
standard techniques for a complete data.
In regression imputation, the existing variables are used to make a prediction, and then the
predicted value is substituted as if an actual obtained value. This approach has a number of
advantages, because the imputation retains a great deal of data over the listwise or pairwise
deletion and avoids significantly altering the standard deviation or the shape of the
distribution. However, as in a mean substitution, while a regression imputation substitutes a
value that is predicted from other variables, no novel information is added, while the sample
size has been increased and the standard error is reduced.
Although simple, this method strongly assumes that the value of the outcome remains
unchanged by the missing data, which seems unlikely in many settings (especially in the
anesthetic trials). It produces a biased estimate of the treatment effect and underestimates
the variability of the estimated result. Accordingly, the National Academy of Sciences has
recommended against the uncritical use of the simple imputation, including LOCF and the
baseline observation carried forward, stating that:
Single imputation methods like last observation carried forward and baseline observation
carried forward should not be used as the primary approach to the treatment of missing data
unless the assumptions that underlie them are scientifically justified.
Maximum likelihood
There are a number of strategies using the maximum likelihood method to handle the missing
data. In these, the assumption that the observed data are a sample drawn from a multivariate
normal distribution is relatively easy to understand. After the parameters are estimated using
the available data, the missing data are estimated based on the parameters which have just
been estimated.
When there are missing but relatively complete data, the statistics explaining the
relationships among the variables may be computed using the maximum likelihood method.
That is, the missing data may be estimated by using the conditional distribution of the other
variables.
Expectation-Maximization
Expectation-Maximization (EM) is a type of the maximum likelihood method that can be used
to create a new data set, in which all missing values are imputed with values estimated by the
maximum likelihood methods. This approach begins with the expectation step, during which
the parameters (e.g., variances, covariances, and means) are estimated, perhaps using the
listwise deletion. Those estimates are then used to create a regression equation to predict
the missing data. The maximization step uses those equations to fill in the missing data. The
expectation step is then repeated with the new parameters, where the new regression
equations are determined to "fill in" the missing data. The expectation and maximization
steps are repeated until the system stabilizes, when the covariance matrix for the subsequent
iteration is virtually the same as that for the preceding iteration.
Multiple imputation
Multiple imputation is another useful strategy for handling the missing data. In a multiple
imputation, instead of substituting a single value for each missing data, the missing values are
replaced with a set of plausible values which contain the natural variability and uncertainty
of the right values.
This approach begin with a prediction of the missing data using the existing data from other
variables. The missing values are then replaced with the predicted values, and a full data set
called the imputed data set is created. This process iterates the repeatability and makes
multiple imputed data sets (hence the term "multiple imputation"). Each multiple imputed
data set produced is then analyzed using the standard statistical analysis procedures for
complete data, and gives multiple analysis results. Subsequently, by combining these analysis
results, a single overall analysis result is produced.
The benefit of the multiple imputation is that in addition to restoring the natural variability
of the missing values, it incorporates the uncertainty due to the missing data, which results
in a valid statistical inference. Restoring the natural variability of the missing data can be
achieved by replacing the missing data with the imputed values which are predicted using the
variables correlated with the missing data. Incorporating uncertainty is made by producing
different versions of the missing data and observing the variability between the imputed data
sets.
Multiple imputation has been shown to produce valid statistical inference that reflects the
uncertainty associated with the estimation of the missing data. Furthermore, multiple
imputation turns out to be robust to the violation of the normality assumptions and produces
appropriate results even in the presence of a small sample size or a high number of missing
data.
With the development of novel statistical software, although the statistical principles of
multiple imputation may be difficult to understand, the approach may be utilized easily.
Sensitivity analysis
Sensitivity analysis is defined as the study which defines how the uncertainty in the output of
a model can be allocated to the different sources of uncertainty in its inputs.
When analyzing the missing data, additional assumptions on the reasons for the missing data
are made, and these assumptions are often applicable to the primary analysis. However, the
assumptions cannot be definitively validated for the correctness. Therefore, the National
Research Council has proposed that the sensitivity analysis be conducted to evaluate the
robustness of the results to the deviations from the MAR assumption.
Recommendations
Missing data reduces the power of a trial. Some amount of missing data is expected, and the
target sample size is increased to allow for it. However, such cannot eliminate the potential
bias. More attention should be paid to the missing data in the design and performance of the
studies and in the analysis of the resulting data.
The best solution to the missing data is to maximize the data collection when the study
protocol is designed and the data collected. Application of the sophisticated statistical
analysis techniques should only be performed after the maximal efforts have been employed
to reduce missing data in the design and prevention techniques.
A statistically valid analysis which has appropriate mechanisms and assumptions for the
missing data should be conducted. Single imputation and LOCF are not optimal approaches
for the final analysis, as they can cause bias and lead to invalid conclusions. All variables which
present the potential mechanisms to explain the missing data must be included, even when
these variables are not included in the analysis. Researchers should seek to understand the
reasons for the missing data. Distinguishing what should and should not be imputed is usually
not possible using a single code for every type of the missing value. It is difficult to know
whether the multiple imputation or full maximum likelihood estimation is best, but both are
superior to the traditional approaches. Both techniques are best used with large samples. In
general, multiple imputation is a good approach when analyzing data sets with missing data.
Data visualization
Disadvantages
While there are many advantages, some of the disadvantages may seem less obvious. For
example, when viewing a visualization with many different datapoints, it’s easy to make an
inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s
biased or confusing.
While traditional education typically draws a distinct line between creative storytelling and
technical analysis, the modern professional world also values those who can cross between
the two: data visualization sits right in the middle of analysis and visual storytelling.
Industries where data visualization can be used or are used. Almost all the industries.
Data classification
What is Data Classification
Data classification tags data according to its type, sensitivity, and value to the organization if
altered, stolen, or destroyed. It helps an organization understand the value of its data,
determine whether the data is at risk, and implement controls to mitigate risks. Data
classification also helps an organization comply with relevant industry-specific regulatory
mandates such as SOX, HIPAA, PCI DSS, and GDPR.
Data Discovery
Classifying data requires knowing the location, volume, and context of data.
Most modern businesses store large volumes of data, which may be spread across multiple
repositories:
Databases deployed on-premises or in the cloud
Big data platforms
Collaboration systems such as Microsoft SharePoint
Cloud storage services such as Dropbox and Google Docs
Files such as spreadsheets, PDFs, or emails
Before you can perform data classification, you must perform accurate and comprehensive
data discovery. Automated tools can help discover sensitive data at large scale. See our article
on Data Discovery for more information.
The policy also determines the data classification process: how often data classification
should take place, for which data, which type of data classification is suitable for different
types of data, and what technical means should be used to classify data. The data
classification policy is part of the overall information security policy, which specifies how to
protect sensitive data.
Example
Suppose a retail company wants to increase its sales by identifying the factors that influence
customer purchase decisions. The data science team will identify the problem and plan the
project by determining the data sources (e.g., transaction data, customer data), the data
collection process (e.g., data cleaning, data transformation), and the analytical methods (e.g.,
regression analysis, decision trees) that will be used to analyze the data.
Example
In the retail company example, the data science team will collect data on customer
demographics, transaction history, and product information.
Having acquired the data, data scientists have to clean and reformat the data by manually
editing it in the spreadsheet or by writing code. This step of the data science project lifecycle
does not produce any meaningful insights. However, through regular data cleaning, data
scientists can easily identify what foibles exists in the data acquisition process, what
assumptions they should make and what models they can apply to produce analysis results.
Data after reformatting can be converted to JSON, CSV or any other format that makes it easy
to load into one of the data science tools.
Exploratory data analysis forms an integral part at this stage as summarization of the clean
data can help identify outliers, anomalies and patterns that can be usable in the subsequent
steps. This is the step that helps data scientists answer the question on as to what do they
actually want to do with this data.
Analyze the data to better understand its properties, such as data kinds, ranges, and
distributions. Identify any potential issues, such as missing values, exceptions, or errors.
Choose a representative sample of the dataset for validation. This technique is useful for
larger datasets because it minimizes processing effort.
Apply planned validation rules to the collected data. Rules may contain format checks, range
validations, or cross-field validations.
Identify records that do not fulfill the validation standards. Keep track of any flaws or
discrepancies for future analysis.
Correct identified mistakes by cleaning, converting, or entering data as needed. Maintaining
an audit record of modifications made during this procedure is critical.
Automate data validation activities as much as feasible to ensure consistent and ongoing data
quality maintenance
Example
In the retail company example, the data science team will remove any duplicate or missing
data from the customer and transaction datasets. They may also merge the datasets to create
a single dataset that can be analyzed.
The fourth step in the data science project life cycle is data analysis. This involves applying
analytical methods to the data to extract insights and patterns. The data science team may
use techniques such as regression analysis, clustering, or machine learning algorithms to
analyze the data.
Example
In the retail company example, the data science team may use regression analysis to identify
the factors that influence customer purchase decisions. They may also use clustering to
segment customers based on their purchase behavior.
Example
In the retail company example, the data science team may build a predictive model that can
be used to predict customer purchase behavior based on demographic and product
information.
Step 6: Model Evaluation
The sixth step in the data science project life cycle is model evaluation. This involves
evaluating the performance of the predictive model to ensure that it is accurate and reliable.
The data science team will test the model using a validation dataset to determine its accuracy
and performance.
There are different evaluation metrics for different performance metrics. For instance, if the
machine learning model aims to predict the daily stock then the RMSE (root mean squared
error) will have to be considered for evaluation. If the model aims to classify spam emails
then performance metrics like average accuracy, AUC and log loss have to be considered. A
common question that professionals often have when evaluating the performance of a
machine learning model is that which dataset they should use to measure the performance
of the machine learning model. Looking at the performance metrics on the trained dataset is
helpful but is not always right because the numbers obtained might be overly optimistic as
the model is already adapted to the training dataset. Machine learning model performances
should be measured and compared using validation and test sets to identify the best model
based on model accuracy and over-fitting.
All the above steps from 1 to 4 are iterated as data is acquired continuously and business
understanding become much clearer.
Example
In the retail company example, the data science team may test the predictive model using a
validation dataset to ensure that it accurately predicts customer purchase behavior.
Example
In the retail company example, the data science team may deploy the predictive model into
the company’s customer relationship management (CRM) system so that it can be used to
make targeted marketing campaigns.
Conclusion
The data science project life cycle provides a structured approach for data scientists to
develop data-driven solutions that address specific business problems.
By following the steps outlined in the data science project life cycle, data scientists can ensure
that their projects are completed efficiently and effectively. This methodology enables data
scientists to deliver high-quality solutions that provide real value to the business.