0% found this document useful (0 votes)
21 views32 pages

Unit - 2 Notes - BADS

Notes

Uploaded by

prashatri5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views32 pages

Unit - 2 Notes - BADS

Notes

Uploaded by

prashatri5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT II (8 hrs.

)
Data: Data Collection, Data Management, Big Data Management, Organization/sources of
data, Types of Data, Importance of data quality, Dealing with missing or incomplete data, Data
Visualization, Data Classification
Data Science Project Life Cycle: Business Requirement, Data Acquisition, Data Preparation,
Hypothesis and Modelling, Evaluation and Interpretation, Deployment, Operations,
optimization.

Data collection
Below are seven steps you can use to collect data:
1. Identify opportunities for data collection
The first step to collecting data is to find an opportunity where that process would be useful
for you or your organization. This can include situations such as needing sales information,
wanting to understand public opinion about a product, brand or company or wanting to
preserve certain information for later use. Sometimes, the opportunities for data collection
come from challenges presented in the workplace, such as inefficient processes or decreasing
sales, which you can address and overcome by collecting data and making changes through
data-based decision-making. There may also be more than one opportunity from which to
choose.
2. Select opportunities and set goals
Once you understand the opportunities available to you for data collection, you can choose
which ones you want to pursue. Some may be easier than others, but many have positive
benefits, such as increased sales or having more information to work with. After you select
the opportunities you want to pursue, set goals for how you want the data to be used and
why that data is important to you or your organization. This can help you further refine your
data collection and create short- and long-term projects you can assign to project teams.
3. Create a plan and set methods for data collection
Once you have your opportunities and goals prepared, you can create plans for short- and
long-term projects to complete. This includes assigning teams to work on those projects,
creating ways to store information and planning for how much information you want to
gather. This is also where you can choose which methods of data collection you use, which
depends on the type of data you want to collect. For example, using questionnaires is a good
way to gather information about public opinion, while monitoring social media and online
marketing can produce data as numbers you can use for other projects.
4. Validate your systems of measurement
Once you create your plans for data collection and decide on a method you want to use,
ensure that your systems of measurement are accurate and that you are gathering the
information you want. Accurate measurements are important for data collection because the
data you collect can help you make decisions and having accurate data measurements,
especially for quantitative data, can help you make the best decisions for your organization.
This step is also important because it can help you understand the best methods of data
collection for your particular needs and help you create rigorous methods for your data.
5. Collect data
After creating plans and methods for collecting the data you want, you can enact those plans.
This can be the longest step in the data collection process because you may have thousands
of data points you want to collect depending on the type of information you want and the
method you use. Some methods, such as social media monitoring, are an ongoing process,
so setting a start and end date for data collection can help you process the information more
accurately. Others, such as interviews and research, can be short to implement and do but
have longer interpretation times.
6. Analyze data
After you finish collecting your data or set a time to stop the data collection for those methods
that are ongoing, you can analyze your data. This includes organizing it in a way that you and
project teams can understand and then interpreting that data to inform your decision-
making. Often, a combination of qualitative and quantitative data is useful for decision-
making. For example, collecting opinions about your brand after noticing that its sales are
declining can help you make decisions that increase public opinion and sales once you
understand your data.
7. Act based on the data
The last step for data collection is to understand how to react to the information you gather.
For example, if your qualitative data about marketing shows that sales are increasing steadily,
without further input from the marketing team, then you can continue to act as you have
been. Similarly, if your data shows sales are decreasing, then you can form a plan to increase
marketing efforts and even create a new product that consumers may want to use. Either
decision is data-based decision-making and can help you make excellent choices for your
organization.

Data Management
Put simply, a data management strategy is an organization’s roadmap for using data to
achieve its goals. This roadmap ensures that all the activities surrounding data
management—from collection to collaboration—work together effectively and efficiently to
be as useful as possible and easy to govern. With a data management strategy in place, your
company can avoid some of these common data challenges:
 Incompatible, duplicate, or missing data from undocumented or inconsistently
documented sources
 Siloed projects that use the same data, yet duplicate the efforts and costs associated
with that data
 Data activities that consume time and resources but do not contribute to overall
business objectives
A data management strategy will be the strong foundation needed for consistent project
approaches, successful integration, and business growth.

5 steps to an effective data management strategy


If your company faces these kinds of challenges, it’s time to develop an enterprise data
management strategy. Fine-tuning and finalizing a strategy that works best for your business
will take time, but you can start with these five steps.
1. Identify business objectives
If you don’t let your business objectives inform your data management strategy, you could
waste valuable time and resources collecting, storing, and analyzing the wrong types of data.
It is usually helpful to ask questions like:
 What are your organization’s overall objectives?
 What data is needed to meet these objectives?
 What types of insights and information are required to make progress against these
initiatives?
Focus on the three to five most critical use cases for your company’s data and build your
strategy from there. With your business objective at top of mind, these priorities will help
determine processes, tools, governance, and more.

2. Create strong data processes


Now that you know how you will use your data, it’s time to think through the processes in
place for collecting, preparing, storing, and distributing the data. Begin by identifying the
owners and stakeholders for each of the following data management activities. The questions
below are a great place to start as you consider each step of the process.
Collect
 What will be your data sources?
 Will you need access to both external and internal assets?
 Do you need structured data, unstructured data, or a combination of both?
 How will the data be collected?
 Is this a task that will be done manually as needed or will you set up extract scheduling?

Prepare
 How will you clean and transform raw data to prepare it for analysis?
 How will you identify incomplete or disparate data?
 What will be the guidelines for naming data, documenting lineage, and adding
metadata to increase discoverability?
Store
 Where will you store your data?
 Will you use XML, CSV, or relational databases for structured data?
 Do you need a data lake for unstructured data?
 How will you keep your data secure?

Analyze and Distribute


 Which teams or departments need the ability to collaborate?
 How can you make access to data and analysis easier for the end-user?
 How will you communicate any data insights?

Big Data management


What is big data?
Big data is a combination of structured, semi-structured and unstructured data that
organizations collect, analyze and mine for information and insights. It's used in machine
learning projects, predictive modeling and other advanced analytics applications.
Systems that process and store big data have become a common component of data
management architectures in organizations. They're combined with tools that support big
data analytics uses. Big data is often characterized by the three V's:
The large volume of data in many environments.
The wide variety of data types frequently stored in big data systems.
The high velocity at which the data is generated, collected and processed.

Big data management technologies


Hadoop, an open source distributed processing framework released in 2006, was initially at
the center of most big data architectures. The development of Spark and other processing
engines pushed MapReduce, the engine built into Hadoop, more to the side. The result is an
ecosystem of big data technologies that can be used for different applications but often are
deployed together.

IT vendors offer big data platforms and managed services that combine many of those
technologies in a single package, primarily for use in the cloud. For organizations that want
to deploy big data systems themselves, either on premises or in the cloud, various tools are
available in addition to Hadoop and Spark. They include the following categories of tools:
Storage repositories.
Cluster management frameworks.
Stream processing engines.
NoSQL databases.
Data lake and data warehouse platforms.
SQL query engines.

Big data benefits


Organizations that use and manage large data volumes correctly can reap many benefits, such
as the following:
Enhanced decision-making. An organization can glean important insights, risks, patterns or
trends from big data. Large data sets are meant to be comprehensive and encompass as much
information as the organization needs to make better decisions. Big data insights let business
leaders quickly make data-driven decisions that impact their organizations.
Better customer and market insights. Big data that covers market trends and consumer habits
gives an organization the important insights it needs to meet the demands of its intended
audiences. Product development decisions, in particular, benefit from this type of insight.
Cost savings. Big data can be used to pinpoint ways businesses can enhance operational
efficiency. For example, analysis of big data on a company's energy use can help it be more
efficient.
Positive social impact. Big data can be used to identify solvable problems, such as improving
healthcare or tackling poverty in a certain area.

Big data challenges


There are common challenges for data experts when dealing with big data. They include the
following:

Architecture design. Designing a big data architecture focused on an organization's processing


capacity is a common challenge for users. Big data systems must be tailored to an
organization's particular needs. These types of projects are often do-it-yourself undertakings
that require IT and data management teams to piece together a customized set of
technologies and tools.
Skill requirements. Deploying and managing big data systems also requires new skills
compared to the ones that database administrators and developers focused on relational
software typically possess.
Costs. Using a managed cloud service can help keep costs under control. However, IT
managers still must keep a close eye on cloud computing use to make sure costs don't get out
of hand.
Migration. Migrating on-premises data sets and processing workloads to the cloud can be a
complex process.
Accessibility. Among the main challenges in managing big data systems is making the data
accessible to data scientists and analysts, especially in distributed environments that include
a mix of different platforms and data stores. To help analysts find relevant data, data
management and analytics teams are increasingly building data catalogs that incorporate
metadata management and data lineage functions.
Integration. The process of integrating sets of big data is also complicated, particularly when
data variety and velocity are factors.

Sources of data
What are the sources of data?
In short, the sources of data are physical or digital places where information is stored in a
data table, data object, or some other storage format.

Data can be gathered in 2 ways: primary & secondary


For data analysis, it all must be collected through primary or secondary research. A data
source is a pool of statistical facts and non-statistical facts that a researcher or analyst can use
to do more work on their research. Data analytics and data analysis are closely related
processes that involve extracting insights from data to make informed decisions.
Primary data
When you do your research or pay someone who does the research for you.

Secondary data
When data collection happens outside of the organization, it is called an external data source.
In every way, they are outside of the company. As a researcher, you can work for external data
collection.
The data from external origins is harder to gather because it is much more varied, and there
can be many of them. There are different groups into which external data can be put. They
are given below:
Government publications
Researchers can get a massive amount of information from government sources. Also, you
can get much of this information for free on the Internet.
Non-government publications
Researchers can also find industry-related information in non-government publications. The
only research problem with non-government publications is that their data may sometimes
be biased.
Syndicate services
Some companies offer Syndicate services. As part of this, they collect and organize the same
marketing information for all their clients. Surveys, mail diary panels, electronic services,
wholesalers, industrial firms, retailers, etc., are ways they get information from households.

Types of data
As defined above – primary and secondary

Importance of data quality


Increasingly, organisations use data to aid in the decision-making process, which has led to
an increased importance of data quality in a business. Data quality is important because it
ensures that the information used to make key business decisions is reliable, accurate, and
complete.
It is critical to ensure data quality throughout the data management process. Data accuracy
and reliability are key factors for executives to be able to trust data and make well-informed
decisions. When data quality practices are poor, there can be significant repercussions.
Imprecise analytics, profit loss, unreliable business strategies, and operational errors can all
be traced to poor-quality data.
Using high-quality data, businesses can analyse data, conduct marketing campaigns, and
create reliable strategies much more quickly and efficiently. This results in better return on
investment, and more precise marketing.
As well as improving the dataset itself, high-quality data can help reduce risks, costs, and
worker productivity. With quality data, marketers and data managers spend less time
identifying and validating data errors, and more time using the data for its purpose.
Quality data can also help businesses engage with customers more effectively, ensuring that
those in the database are valid, active contacts. It can even help avoid brand damage. For
example, many organisations screen data for deceased contacts in order to avoid sending
marketing materials to the individual or their families, which could otherwise be viewed as
insensitive.

Data quality and data compliance


There is a direct crossover between data quality and data compliance. For example, data
protection laws, such as the General Data Protection Regulation (GDPR), require businesses
to correct inaccurate or incomplete personal information. To maintain high data quality
standards, businesses must ensure the accuracy of their information.
Data inaccuracies are often the leading cause of data leaks, accounting for 88% of UK data
breaches, which is one of the reasons these laws are in place.
In order to remain compliant and secure, businesses should undertake regular data quality
audits. In one of our 2021 studies, surveys found that 80% of SMEs were aware of GDPR laws
around clean and accurate personal data. This means that 20% of SMEs were not.

Additionally, failing to implement data quality standards can result in GDPR fines. Under GDPR
compliance laws, businesses can face fines of up to £17.5 million or 4% of the preceding
financial year’s global turnover – whichever is higher.
These fines highlight the importance of keeping data clean and of high quality. Not only for
better business performance, but also for data compliance.

What does good data quality look like?


Good data quality can look different for every dataset. Data quality is less about hitting a
certain standardised criteria, and more so about ensuring that the data is suitable for its
specific purpose. For example, a healthcare company might require a list of complete,
accurate, and valid healthcare records in order for the data to be high quality. Whereas this
kind of data would not be relevant in other industries.

It is therefore not necessary for every value to be flawless; this is why there will be different
levels of good quality in different datasets. It is ideal to remember that good quality datasets
do not have a universal criterion, but a proactive approach to data quality management and
improving poor quality data is crucial.

Uniqueness: Data is considered high quality when it is unique. This helps ensure that there is
no duplication in values across the dataset, keeping data clean and precise. Removing
duplicate entries can help avoid sending multiple marketing communications to the same
contact, reducing costs and protecting the brand image.
Completeness: Data is complete when the dataset contains all the necessary information
required to carry out specific activities. Completeness does not mean that every possible
entry has to be full – it is about fulfilling relevant data entries specific to the intended activity.
For example, an email marketing database would require a full set of email addresses in order
to be complete, but it would not require phone numbers in order to carry out the core
activities.
Consistency: Consistency refers to how well the data entries follow the same format
throughout the dataset. To ensure consistency, the same data values and formatting should
be used throughout. For example, phone numbers should all be presented in the same way
for each contact, such as 07 vs +44.
Accuracy: Accuracy is one of the most important characteristics of high-quality data. This
refers to how well the data reflects reality. For instance, a postcode that is not truly reflective
of the contact’s address would be inaccurate. Businesses need reliable information to make
informed decisions. Inaccurate data needs to be identified, documented, and fixed to make
sure they have the highest quality information possible. It is essential that the data used in
marketing and advertising is accurate in order to ensure that communications target active
customers and prevent mistakes.
Timeliness: Timeliness refers to how readily available the data is. Data needs to be easily
accessible in order to be useful. If not, then this can hinder the performance of campaigns –
especially where time is of the essence.
Validity: The validity of information refers to the format it is presented in. For example,
birthdays can be formatted in different ways: day/month/year or month/day/year. This
format can vary depending on the country, industry, or business standards. In order for data
to be valid, it needs to be entered in the way that the data system recognises. For instance,
the birthday 14/05/1998 would be invalid in a system that formats birthdays in the
month/day/year format – since months of the year do not exceed 12.

How to improve data quality


When considering how to improve data quality, the first step is to assess your data’s current
state. Take a look at what you have, and compare this to what you need to perform your
intended activities.
This will help you identify the main concerns and areas of improvement in your dataset. For
example, are there duplicate entries? Are there data inaccuracies? Is there missing
information?
Once you have identified your main data quality concerns, put together a list of clear
objectives. As an example, you might need to correct data inaccuracies, deduplicate data,
standardise its format, or discard data from a certain time.
Identifying these can sometimes be challenging, especially for those data errors that are
hidden in the dataset, which is where data cleansing services can really help.
Once you have defined your objectives, it’s time to implement these actions across your
datasets. It is also important that you assess data quality across all datasets in order to
improve data quality throughout your entire organisation.
After everything is set into motion, schedule regular data quality audits. This will help you
ensure consistent data quality practices moving forward, and ensure that new errors are
addressed as they occur. One way that businesses do this is with an online data management
platform, which helps audit and identify data quality issues

Dealing with missing or incomplete data

Even in a well-designed and controlled study, missing data occurs in almost all research.
Missing data can reduce the statistical power of a study and can produce biased estimates,
leading to invalid conclusions. This manuscript reviews the problems and types of missing
data, along with the techniques for handling missing data. The mechanisms by which missing
data occurs are illustrated, and the methods for handling the missing data are discussed. The
paper concludes with recommendations for the handling of missing data.
Missing data (or missing values) is defined as the data value that is not stored for a variable
in the observation of interest. The problem of missing data is relatively common in almost all
research and can have a significant effect on the conclusions that can be drawn from the data.
Accordingly, some studies have focused on handling the missing data, problems caused by
missing data, and the methods to avoid or minimize such in medical research.

However, until recently, most researchers have drawn conclusions based on the assumption
of a complete data set. The general topic of missing data has attracted little attention in the
field of anesthesiology.

Missing data present various problems. First, the absence of data reduces statistical power,
which refers to the probability that the test will reject the null hypothesis when it is false.
Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the
representativeness of the samples. Fourth, it may complicate the analysis of the study. Each
of these distortions may threaten the validity of the trials and can lead to invalid conclusions.

Types of Missing Data


Rubin first described and divided the types of missing data according to the assumptions
based on the reasons for the missing data [4]. In general, there are three types of missing
data according to the mechanisms of missingness.
Missing completely at random
Missing completely at random (MCAR) is defined as when the probability that the data are
missing is not related to either the specific value which is supposed to be obtained or the set
of observed responses. MCAR is an ideal but unreasonable assumption for many studies
performed in the field of anesthesiology. However, if data are missing by design, because of
an equipment failure or because the samples are lost in transit or technically unsatisfactory,
such data are regarded as being MCAR.
The statistical advantage of data that are MCAR is that the analysis remains unbiased. Power
may be lost in the design, but the estimated parameters are not biased by the absence of the
data.
Missing at random
Missing at random (MAR) is a more realistic assumption for the studies performed in the
anesthetic field. Data are regarded to be MAR when the probability that the responses are
missing depends on the set of observed responses, but is not related to the specific missing
values which is expected to be obtained.
As we tend to consider randomness as not producing bias, we may think that MAR does not
present a problem. However, MAR does not mean that the missing data can be ignored. If a
dropout variable is MAR, we may expect that the probability of a dropout of the variable in
each case is conditionally independent of the variable, which is obtained currently and
expected to be obtained in the future, given the history of the obtained variable prior to that
case.
Missing not at random
If the characters of the data do not meet those of MCAR or MAR, then they fall into the
category of missing not at random (MNAR).
The cases of MNAR data are problematic. The only way to obtain an unbiased estimate of the
parameters in such a case is to model the missing data. The model may then be incorporated
into a more complex one for estimating the missing values.

Techniques for Handling the Missing Data


The best possible method of handling the missing data is to prevent the problem by well-
planning the study and collecting the data carefully [5,6]. The following are suggested to
minimize the amount of missing data in the clinical research [7].
First, the study design should limit the collection of data to those who are participating in the
study. This can be achieved by minimizing the number of follow-up visits, collecting only the
essential information at each visit, and developing the userfriendly case-report forms.
Second, before the beginning of the clinical research, a detailed documentation of the study
should be developed in the form of the manual of operations, which includes the methods to
screen the participants, protocol to train the investigators and participants, methods to
communicate between the investigators or between the investigators and participants,
implementation of the treatment, and procedure to collect, enter, and edit data.
Third, before the start of the participant enrollment, a training should be conducted to
instruct all personnel related to the study on all aspects of the study, such as the participant
enrollment, collection and entry of data, and implementation of the treatment or
intervention [8].
Fourth, if a small pilot study is performed before the start of the main trial, it may help to
identify the unexpected problems which are likely to occur during the study, thus reducing
the amount of missing data.
Fifth, the study management team should set a priori targets for the unacceptable level of
missing data. With these targets in mind, the data collection at each site should be monitored
and reported in as close to real-time as possible during the course of the study.
Sixth, study investigators should identify and aggressively, though not coercively, engage the
participants who are at the greatest risk of being lost during follow-up.
Finally, if a patient decides to withdraw from the follow-up, the reasons for the withdrawal
should be recorded for the subsequent analysis in the interpretation of the results.

It is not uncommon to have a considerable amount of missing data in a study. One technique
of handling the missing data is to use the data analysis methods which are robust to the
problems caused by the missing data. An analysis method is considered robust to the missing
data when there is confidence that mild to moderate violations of the assumptions will
produce little to no bias or distortion in the conclusions drawn on the population. However,
it is not always possible to use such techniques.

Therefore, a number of alternative ways of handling the missing data has been developed.
Listwise or case deletion
By far the most common approach to the missing data is to simply omit those cases with the
missing data and analyze the remaining data. This approach is known as the complete case
(or available case) analysis or listwise deletion.

Listwise deletion is the most frequently used method in handling missing data, and thus has
become the default option for analysis in most statistical software packages. Some
researchers insist that it may introduce bias in the estimation of the parameters. However, if
the assumption of MCAR is satisfied, a listwise deletion is known to produce unbiased
estimates and conservative results. When the data do not fulfill the assumption of MCAR,
listwise deletion may cause bias in the estimates of the parameters [9].

If there is a large enough sample, where power is not an issue, and the assumption of MCAR
is satisfied, the listwise deletion may be a reasonable strategy. However, when there is not a
large sample, or the assumption of MCAR is not satisfied, the listwise deletion is not the
optimal strategy.

Pairwise deletion
Pairwise deletion eliminates information only when the particular data-point needed to test
a particular assumption is missing. If there is missing data elsewhere in the data set, the
existing values are used in the statistical testing. Since a pairwise deletion uses all information
observed, it preserves more information than the listwise deletion, which may delete the case
with any missing data. This approach presents the following problems: 1) the parameters of
the model will stand on different sets of data with different statistics, such as the sample size
and standard errors; and 2) it can produce an intercorrelation matrix that is not positive
definite, which is likely to prevent further analysis [10].

Pairwise deletion is known to be less biased for the MCAR or MAR data, and the appropriate
mechanisms are included as covariates. However, if there are many missing observations, the
analysis will be deficient.

Mean substitution
In a mean substitution, the mean value of a variable is used in place of the missing data value
for that same variable. This allows the researchers to utilize the collected data in an
incomplete dataset. The theoretical background of the mean substitution is that the mean is
a reasonable estimate for a randomly selected observation from a normal distribution.
However, with missing values that are not strictly random, especially in the presence of a
great inequality in the number of missing values for the different variables, the mean
substitution method may lead to inconsistent bias. Furthermore, this approach adds no new
information but only increases the sample size and leads to an underestimate of the errors.
Thus, mean substitution is not generally accepted.

Regression imputation
Imputation is the process of replacing the missing data with estimated values. Instead of
deleting any case that has any missing value, this approach preserves all cases by replacing
the missing data with a probable value estimated by other available information. After all
missing values have been replaced by this approach, the data set is analyzed using the
standard techniques for a complete data.

In regression imputation, the existing variables are used to make a prediction, and then the
predicted value is substituted as if an actual obtained value. This approach has a number of
advantages, because the imputation retains a great deal of data over the listwise or pairwise
deletion and avoids significantly altering the standard deviation or the shape of the
distribution. However, as in a mean substitution, while a regression imputation substitutes a
value that is predicted from other variables, no novel information is added, while the sample
size has been increased and the standard error is reduced.

Last observation carried forward


In the field of anesthesiology research, many studies are performed with the longitudinal or
time-series approach, in which the subjects are repeatedly measured over a series of time-
points. One of the most widely used imputation methods in such a case is the last observation
carried forward (LOCF). This method replaces every missing value with the last observed value
from the same subject. Whenever a value is missing, it is replaced with the last observed
value.

This method is advantageous as it is easy to understand and communicate between the


statisticians and clinicians or between a sponsor and the researcher.

Although simple, this method strongly assumes that the value of the outcome remains
unchanged by the missing data, which seems unlikely in many settings (especially in the
anesthetic trials). It produces a biased estimate of the treatment effect and underestimates
the variability of the estimated result. Accordingly, the National Academy of Sciences has
recommended against the uncritical use of the simple imputation, including LOCF and the
baseline observation carried forward, stating that:

Single imputation methods like last observation carried forward and baseline observation
carried forward should not be used as the primary approach to the treatment of missing data
unless the assumptions that underlie them are scientifically justified.

Maximum likelihood
There are a number of strategies using the maximum likelihood method to handle the missing
data. In these, the assumption that the observed data are a sample drawn from a multivariate
normal distribution is relatively easy to understand. After the parameters are estimated using
the available data, the missing data are estimated based on the parameters which have just
been estimated.

When there are missing but relatively complete data, the statistics explaining the
relationships among the variables may be computed using the maximum likelihood method.
That is, the missing data may be estimated by using the conditional distribution of the other
variables.

Expectation-Maximization
Expectation-Maximization (EM) is a type of the maximum likelihood method that can be used
to create a new data set, in which all missing values are imputed with values estimated by the
maximum likelihood methods. This approach begins with the expectation step, during which
the parameters (e.g., variances, covariances, and means) are estimated, perhaps using the
listwise deletion. Those estimates are then used to create a regression equation to predict
the missing data. The maximization step uses those equations to fill in the missing data. The
expectation step is then repeated with the new parameters, where the new regression
equations are determined to "fill in" the missing data. The expectation and maximization
steps are repeated until the system stabilizes, when the covariance matrix for the subsequent
iteration is virtually the same as that for the preceding iteration.

An important characteristic of the expectation-maximization imputation is that when the new


data set with no missing values is generated, a random disturbance term for each imputed
value is incorporated in order to reflect the uncertainty associated with the imputation.
However, the expectation-maximization imputation has some disadvantages. This approach
can take a long time to converge, especially when there is a large fraction of missing data, and
it is too complex to be acceptable by some exceptional statisticians. This approach can lead
to the biased parameter estimates and can underestimate the standard error.

For the expectation-maximization imputation method, a predicted value based on the


variables that are available for each case is substituted for the missing data. Because a single
imputation omits the possible differences among the multiple imputations, a single
imputation will tend to underestimate the standard errors and thus overestimate the level of
precision. Thus, a single imputation gives the researcher more apparent power than the data
in reality.

Multiple imputation
Multiple imputation is another useful strategy for handling the missing data. In a multiple
imputation, instead of substituting a single value for each missing data, the missing values are
replaced with a set of plausible values which contain the natural variability and uncertainty
of the right values.
This approach begin with a prediction of the missing data using the existing data from other
variables. The missing values are then replaced with the predicted values, and a full data set
called the imputed data set is created. This process iterates the repeatability and makes
multiple imputed data sets (hence the term "multiple imputation"). Each multiple imputed
data set produced is then analyzed using the standard statistical analysis procedures for
complete data, and gives multiple analysis results. Subsequently, by combining these analysis
results, a single overall analysis result is produced.

The benefit of the multiple imputation is that in addition to restoring the natural variability
of the missing values, it incorporates the uncertainty due to the missing data, which results
in a valid statistical inference. Restoring the natural variability of the missing data can be
achieved by replacing the missing data with the imputed values which are predicted using the
variables correlated with the missing data. Incorporating uncertainty is made by producing
different versions of the missing data and observing the variability between the imputed data
sets.

Multiple imputation has been shown to produce valid statistical inference that reflects the
uncertainty associated with the estimation of the missing data. Furthermore, multiple
imputation turns out to be robust to the violation of the normality assumptions and produces
appropriate results even in the presence of a small sample size or a high number of missing
data.

With the development of novel statistical software, although the statistical principles of
multiple imputation may be difficult to understand, the approach may be utilized easily.

Sensitivity analysis
Sensitivity analysis is defined as the study which defines how the uncertainty in the output of
a model can be allocated to the different sources of uncertainty in its inputs.

When analyzing the missing data, additional assumptions on the reasons for the missing data
are made, and these assumptions are often applicable to the primary analysis. However, the
assumptions cannot be definitively validated for the correctness. Therefore, the National
Research Council has proposed that the sensitivity analysis be conducted to evaluate the
robustness of the results to the deviations from the MAR assumption.

Recommendations
Missing data reduces the power of a trial. Some amount of missing data is expected, and the
target sample size is increased to allow for it. However, such cannot eliminate the potential
bias. More attention should be paid to the missing data in the design and performance of the
studies and in the analysis of the resulting data.

The best solution to the missing data is to maximize the data collection when the study
protocol is designed and the data collected. Application of the sophisticated statistical
analysis techniques should only be performed after the maximal efforts have been employed
to reduce missing data in the design and prevention techniques.

A statistically valid analysis which has appropriate mechanisms and assumptions for the
missing data should be conducted. Single imputation and LOCF are not optimal approaches
for the final analysis, as they can cause bias and lead to invalid conclusions. All variables which
present the potential mechanisms to explain the missing data must be included, even when
these variables are not included in the analysis. Researchers should seek to understand the
reasons for the missing data. Distinguishing what should and should not be imputed is usually
not possible using a single code for every type of the missing value. It is difficult to know
whether the multiple imputation or full maximum likelihood estimation is best, but both are
superior to the traditional approaches. Both techniques are best used with large samples. In
general, multiple imputation is a good approach when analyzing data sets with missing data.

Data visualization

What is data visualization?


Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent
way for employees or business owners to present data to non-technical audiences without
confusion.
In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.
What are the advantages and disadvantages of data visualization?
Something as simple as presenting data in graphic format may seem to have no downsides.
But sometimes data can be misrepresented or misinterpreted when placed in the wrong style
of data visualization. When choosing to create a data visualization, it’s best to keep both the
advantages and disadvantages in mind.
Advantages
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and squares
from circles. Our culture is visual, including everything from art and advertisements to TV and
movies. Data visualization is another form of visual art that grabs our interest and keeps our
eyes on the message. When we see a chart, we quickly see trends and outliers. If we can see
something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a
massive spreadsheet of data and couldn’t see a trend, you know how much more effective a
visualization can be.

Some other advantages of data visualization include:


 Easily sharing information.
 Interactively explore opportunities.
 Visualize patterns and relationships.

Disadvantages
While there are many advantages, some of the disadvantages may seem less obvious. For
example, when viewing a visualization with many different datapoints, it’s easy to make an
inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s
biased or confusing.

Some other disadvantages include:


 Biased or inaccurate information.
 Correlation doesn’t always mean causation.
 Core messages can get lost in translation.

Why data visualization is important


The importance of data visualization is simple: it helps people see, interact with, and better
understand data. Whether simple or complex, the right visualization can bring everyone on
the same page, regardless of their level of expertise.
It’s hard to think of a professional industry that doesn’t benefit from making data more
understandable. Every STEM field benefits from understanding data—and so do fields in
government, finance, marketing, history, consumer goods, service industries, education,
sports, and so on.
While we’ll always wax poetically about data visualization (you’re on the Tableau website,
after all) there are practical, real-life applications that are undeniable. And, since visualization
is so prolific, it’s also one of the most useful professional skills to develop. The better you can
convey your points visually, whether in a dashboard or a slide deck, the better you can
leverage that information. The concept of the citizen data scientist is on the rise. Skill sets are
changing to accommodate a data-driven world. It is increasingly valuable for professionals to
be able to use data to make decisions and use visuals to tell stories of when data informs the
who, what, when, where, and how.

While traditional education typically draws a distinct line between creative storytelling and
technical analysis, the modern professional world also values those who can cross between
the two: data visualization sits right in the middle of analysis and visual storytelling.

Industries where data visualization can be used or are used. Almost all the industries.

General Types of Visualizations:


Chart: Information presented in a tabular, graphical form with data displayed along two axes.
Can be in the form of a graph, diagram, or map.
Table: A set of figures displayed in rows and columns.
Graph: A diagram of points, lines, segments, curves, or areas that represents certain variables
in comparison to each other, usually along two axes at a right angle.
Geospatial: A visualization that shows data in map form using different shapes and colors to
show the relationship between pieces of data and specific locations.
Infographic: A combination of visuals and words that represent data. Usually uses charts or
diagrams.
Dashboards: A collection of visualizations and data displayed in one place to help with
analyzing and presenting data.
More specific examples
Area Map: A form of geospatial visualization, area maps are used to show specific values set
over a map of a country, state, county, or any other geographic location. Two common types
of area maps are choropleths and isopleths.
Bar Chart: Bar charts represent numerical values compared to each other. The length of the
bar represents the value of each variable.
Box-and-whisker Plots: These show a selection of ranges (the box) across a set measure (the
bar).
Bullet Graph: A bar marked against a background to show progress or performance against a
goal, denoted by a line on the graph.
Gantt Chart: Typically used in project management, Gantt charts are a bar chart depiction of
timelines and tasks. Learn more.
Heat Map: A type of geospatial visualization in map form which displays specific data values
as different colors (this doesn’t need to be temperatures, but that is a common use). Learn
more.
Highlight Table: A form of table that uses color to categorize similar data, allowing the viewer
to read it more easily and intuitively.
Histogram: A type of bar chart that split a continuous measure into different bins to help
analyze the distribution. Learn more.
Pie Chart: A circular chart with triangular segments that shows data as a percentage of a
whole.
Treemap: A type of chart that shows different, related values in the form of rectangles nested
together.

Data classification
What is Data Classification
Data classification tags data according to its type, sensitivity, and value to the organization if
altered, stolen, or destroyed. It helps an organization understand the value of its data,
determine whether the data is at risk, and implement controls to mitigate risks. Data
classification also helps an organization comply with relevant industry-specific regulatory
mandates such as SOX, HIPAA, PCI DSS, and GDPR.

Data Sensitivity Levels


Data is classified according to its sensitivity level—high, medium, or low.

High sensitivity data—if compromised or destroyed in an unauthorized transaction, would


have a catastrophic impact on the organization or individuals. For example, financial records,
intellectual property, authentication data.
Medium sensitivity data—intended for internal use only, but if compromised or destroyed,
would not have a catastrophic impact on the organization or individuals. For example, emails
and documents with no confidential data.
Low sensitivity data—intended for public use. For example, public website content.
Data Sensitivity Best Practices
Since the high, medium, and low labels are somewhat generic, a best practice is to use labels
for each sensitivity level that make sense for your organization. Two widely-used models are
shown below.

SENSITIVITY MODEL 1 MODEL 2


High Confidential Restricted
Medium Internal Use Only Sensitive
Low Public Unrestricted
If a database, file, or other data resource includes data that can be classified at two different
levels, it’s best to classify all the data at the higher level.

Types of Data Classification


Data classification can be performed based on content, context, or user selections:
Content-based classification—involves reviewing files and documents, and classifying them
Context-based classification—involves classifying files based on meta data like the application
that created the file (for example, accounting software), the person who created the
document (for example, finance staff), or the location in which files were authored or
modified (for example, finance or legal department buildings).
User-based classification—involves classifying files according to a manual judgement of a
knowledgeable user. Individuals who work with documents can specify how sensitive they
are—they can do so when they create the document, after a significant edit or review, or
before the document is released.

Data States and Data Format


Two additional dimensions of data classifications are:
Data states—data exists in one of three states—at rest, in process, or in transit. Regardless of
state, data classified as confidential must remain confidential.
Data format—data can be either structured or unstructured. Structured data are usually
human readable and can be indexed. Examples of structured data are database objects and
spreadsheets. Unstructured data are usually not human readable or indexable. Examples of
unstructured data are source code, documents, and binaries. Classifying structured data is
less complex and time-consuming than classifying unstructured data.

Data Discovery
Classifying data requires knowing the location, volume, and context of data.

Most modern businesses store large volumes of data, which may be spread across multiple
repositories:
 Databases deployed on-premises or in the cloud
 Big data platforms
 Collaboration systems such as Microsoft SharePoint
 Cloud storage services such as Dropbox and Google Docs
 Files such as spreadsheets, PDFs, or emails
Before you can perform data classification, you must perform accurate and comprehensive
data discovery. Automated tools can help discover sensitive data at large scale. See our article
on Data Discovery for more information.

The Relation Between Data Classification and Compliance


Data classification must comply with relevant regulatory and industry-specific mandates,
which may require classification of different data attributes. For example, the Cloud Security
Alliance (CSA) requires that data and data objects must include data type, jurisdiction of origin
and domicile, context, legal constraints, sensitivity, etc. PCI DSS does not require origin or
domicile tags.

Creating Your Data Classification Policy


A data classification policy defines who is responsible for data classification—typically by
defining Program Area Designees (PAD) who are responsible for classifying data for different
programs or organizational units.

The data classification policy should consider the following questions:


 Which person, organization or program created and/or owns the information?
 Which organizational unit has the most information about the content and context of
the information?
 Who is responsible for the integrity and accuracy of the data?
 Where is the information stored?
 Is the information subject to any regulations or compliance standards, and what are
the penalties associated with non-compliance?
Data classification can be the responsibility of the information creators, subject matter
experts, or those responsible for the correctness of the data.

The policy also determines the data classification process: how often data classification
should take place, for which data, which type of data classification is suitable for different
types of data, and what technical means should be used to classify data. The data
classification policy is part of the overall information security policy, which specifies how to
protect sensitive data.

Data Classification Examples


Following are common examples of data that may be classified into each sensitivity level.

Sensitivity Level Examples


High Credit card numbers (PCI) or other financial account numbers, customer personal data,
FISMA protected information, privileged credentials for IT systems, protected health
information (HIPAA), Social Security numbers, intellectual property, employee records.
Medium Supplier contracts, IT service management information, student education
records (FERPA), telecommunication systems information, internal correspondence not
including confidential data.
Low Content of public websites, press releases, marketing materials, employee directory.
See how Imperva Data Security Solutions can help you with data classification.

Data science project life cycle


The data science project life cycle is a methodology that outlines the stages of a data science
project, from planning to deployment. This methodology guides data scientists through a
structured process that enables them to develop data-driven solutions that address specific
business problems.
The project life cycle provides a framework that helps data scientists to manage projects
effectively and efficiently. In this article, we will explain the steps in data science project
lifecycle, and provide examples and references as necessary.
Step 1: Problem Identification and Planning
The first step in the data science project life cycle is to identify the problem that needs to be
solved. This involves understanding the business requirements and the goals of the project.
Once the problem has been identified, the data science team will plan the project by
determining the data sources, the data collection process, and the analytical methods that
will be used.
Business requirement process
 Identify the stakeholders
 Get the relevant information
 Set goals and objectives
 Summarize scope of capabilities
 Document impacts and constraints
 Describe Purpose and Requirements
 Identifying the goals and requirements for the data analysis project is the first step in
the data preparation process. Consider the followings:
 What is the goal of the data analysis project and how big is it?
 Which major inquiries or ideas are you planning to investigate or evaluate using the
data?
 Who are the target audience and end-users for the data analysis findings? What
positions and duties do they have?
 Which formats, types, and sources of data do you need to access and analyze?
 What requirements do you have for the data in terms of quality, accuracy,
completeness, timeliness, and relevance?
 What are the limitations and ethical, legal, and regulatory issues that you must take
into account?
 With answers to these questions, data analysis project’s goals, parameters, and
requirements simpler as well as highlighting any challenges, risks, or opportunities that
can develop.

Example
Suppose a retail company wants to increase its sales by identifying the factors that influence
customer purchase decisions. The data science team will identify the problem and plan the
project by determining the data sources (e.g., transaction data, customer data), the data
collection process (e.g., data cleaning, data transformation), and the analytical methods (e.g.,
regression analysis, decision trees) that will be used to analyze the data.

Step 2: Data Collection / acquisition


The second step in the data science project life cycle is data collection. This involves collecting
the data that will be used in the analysis. The data science team must ensure that the data is
accurate, complete, and relevant to the problem being solved.
For doing Data Science, you need data. The primary step in the lifecycle of data science
projects is to first identify the person who knows what data to acquire and when to acquire
based on the question to be answered. The person need not necessarily be a data scientist
but anyone who knows the real difference between the various available data sets and
making hard-hitting decisions about the data investment strategy of an organization – will be
the right person for the job.
Data science project begins with identifying various data sources which could be –logs from
webservers, social media data, data from online repositories like US Census datasets, data
streamed from online sources via APIs, web scraping or data could be present in an excel or
can come from any other source. Data acquisition involves acquiring data from all the
identified internal and external sources that can help answer the business question.
A major challenge that data professionals often encounter in data acquisition step is tracking
where each data slice comes from and whether the data slice acquired is up-to-date or not.
It is important to track this information during the entire lifecycle of a data science project as
data might have to be re-acquired to test other hypothesis or run any other updated
experiments.
Steps in data acquisition
 Hypothesizing – use your domain knowledge, creativity, and familiarity with the
problem to try and scope the types of data that could be relevant to your model.
 Generating a list of potential data providers – create a shortlist of sources (data
partners, open data websites, commercial entities) that actually provide the type of
data you hypothesized would be relevant.
 Data provider due diligence – an absolute must. The list of parameters below will help
you disqualify irrelevant data providers before you even get into the time-consuming
and labor-intensive process of checking the actual data.
 Data provider tests – set up a test with each provider that will allow you to measure
the data in an objective way.
 Calculate ROI – once you have a quantified number for the model’s improvement, ROI
can be calculated very easily.
 Integration and production – The last step in acquiring a new data source for your
model is to actually integrate the data provider into your production pipeline.

Example
In the retail company example, the data science team will collect data on customer
demographics, transaction history, and product information.

Step 3: Data Preparation


The third step in the data science project life cycle is data preparation. This involves cleaning
and transforming the data to make it suitable for analysis. The data science team will remove
any duplicates, missing values, or irrelevant data from the dataset. They will also transform
the data into a format that is suitable for analysis.
Often referred as data cleaning or data wrangling phase. Data scientists often complain that
this is the most boring and time consuming task involving identification of various data quality
issues. Data acquired in the first step of a data science project is usually not in a usable format
to run the required analysis and might contain missing entries, inconsistencies and semantic
errors.

Having acquired the data, data scientists have to clean and reformat the data by manually
editing it in the spreadsheet or by writing code. This step of the data science project lifecycle
does not produce any meaningful insights. However, through regular data cleaning, data
scientists can easily identify what foibles exists in the data acquisition process, what
assumptions they should make and what models they can apply to produce analysis results.
Data after reformatting can be converted to JSON, CSV or any other format that makes it easy
to load into one of the data science tools.

Exploratory data analysis forms an integral part at this stage as summarization of the clean
data can help identify outliers, anomalies and patterns that can be usable in the subsequent
steps. This is the step that helps data scientists answer the question on as to what do they
actually want to do with this data.

Data Preparation Process


There are a few important steps in the data preparation process, and each one is essential to
making sure the data is prepared for analysis or other processing. The following are the key
stages related to data preparation:
Data Combining and Integrating Data
Data integration requires combining data from multiple sources or dimensions in order to
create a full, logical dataset. Data integration solutions provide a wide range of operations,
including combination, relationship, connection, difference, and join, as well as a variety of
data schemas and types of architecture.
To properly combine and integrate data, it is essential to store and arrange information in a
common standard format, such as CSV, JSON, or XML, for easy access and uniform
comprehension. Organizing data management and storage using solutions such as cloud
storage, data warehouses, or data lakes improves governance, maintains consistency, and
speeds up access to data on a single platform.
Audits, backups, recovery, verification, and encryption are all examples of strong security
procedures that can be used to make sure reliable data management. Privacy protects data
during transmission and storage, whereas authorization and authentication
Data Profiling
Data profiling is a systematic method for assessing and analyzing a dataset, making sure its
quality, structure, content, and improving accuracy within an organizational context. Data
profiling identifies data consistency, differences, and null values by analyzing source data,
looking for errors, inconsistencies, and errors, and understanding file structure, content, and
relationships. It helps to evaluate elements including completeness, accuracy, consistency,
validity, and timeliness.
Data Exploring
Data exploration is getting familiar with data, identifying patterns, trends, outliers, and errors
in order to better understand it and evaluate the possibilities for analysis. To evaluate data,
identify data types, formats, and structures, and calculate descriptive statistics such as mean,
median, mode, and variance for each numerical variable. Visualizations such as histograms,
boxplots, and scatterplots can provide understanding of data distribution, while complex
techniques such as classification can reveal hidden patterns and show exceptions.
Data Transformations and Enrichment
Data enrichment is the process of improving a dataset by adding new features or columns,
enhancing its accuracy and reliability, and verifying it against third-party sources.
The technique involves combining various data sources like CRM, financial, and marketing to
create a comprehensive dataset, incorporating third-party data like demographics for
enhanced insights.
The process involves categorizing data into groups like customers or products based on
shared attributes, using standard variables like age and gender to describe these entities.
Engineer new features or fields by utilizing existing data, such as calculating customer age
based on their birthdate. Estimate missing values from available data, such as absent sales
figures, by referencing historical trends.
The task involves identifying entities like names and addresses within unstructured text data,
thereby extracting actionable information from text without a fixed structure.
The process involves assigning specific categories to unstructured text data, such as product
descriptions or customer feedback, to facilitate analysis and gain valuable insights.
Utilize various techniques like geocoding, sentiment analysis, entity recognition, and topic
modeling to enrich your data with additional information or context.
To enable analysis and generate important insights, unstructured text data is classified into
different groups, such as product descriptions or consumer feedback.
Use cleaning procedures to remove or correct flaws or inconsistencies in your data, such as
duplicates, outliers, missing numbers, typos, and formatting difficulties. Validation techniques
like as checksums, rules, limitations, and tests are used to ensure that data is correct and
complete.
Data Validation
Data validation is crucial for ensuring data accuracy, completeness, and consistency, as it
checks data against predefined rules and criteria that align with your requirements,
standards, and regulations.

Analyze the data to better understand its properties, such as data kinds, ranges, and
distributions. Identify any potential issues, such as missing values, exceptions, or errors.
Choose a representative sample of the dataset for validation. This technique is useful for
larger datasets because it minimizes processing effort.
Apply planned validation rules to the collected data. Rules may contain format checks, range
validations, or cross-field validations.
Identify records that do not fulfill the validation standards. Keep track of any flaws or
discrepancies for future analysis.
Correct identified mistakes by cleaning, converting, or entering data as needed. Maintaining
an audit record of modifications made during this procedure is critical.
Automate data validation activities as much as feasible to ensure consistent and ongoing data
quality maintenance
Example
In the retail company example, the data science team will remove any duplicate or missing
data from the customer and transaction datasets. They may also merge the datasets to create
a single dataset that can be analyzed.

Step 4: Data Analysis

The fourth step in the data science project life cycle is data analysis. This involves applying
analytical methods to the data to extract insights and patterns. The data science team may
use techniques such as regression analysis, clustering, or machine learning algorithms to
analyze the data.
Example
In the retail company example, the data science team may use regression analysis to identify
the factors that influence customer purchase decisions. They may also use clustering to
segment customers based on their purchase behavior.

Step 5: Hypothesis & Model Building


The fifth step in the data science project life cycle is model building. This involves building a
predictive model that can be used to make predictions based on the data analysis. The data
science team will use the insights and patterns from the data analysis to build a model that
can predict future outcomes.
This is the core activity of a data science project that requires writing, running and refining
the programs to analyse and derive meaningful business insights from data. Often these
programs are written in languages like Python, R, MATLAB or Perl. Diverse machine learning
techniques are applied to the data to identify the machine learning model that best fits the
business needs. All the contending machine learning models are trained with the training
data sets.

Example
In the retail company example, the data science team may build a predictive model that can
be used to predict customer purchase behavior based on demographic and product
information.
Step 6: Model Evaluation
The sixth step in the data science project life cycle is model evaluation. This involves
evaluating the performance of the predictive model to ensure that it is accurate and reliable.
The data science team will test the model using a validation dataset to determine its accuracy
and performance.
There are different evaluation metrics for different performance metrics. For instance, if the
machine learning model aims to predict the daily stock then the RMSE (root mean squared
error) will have to be considered for evaluation. If the model aims to classify spam emails
then performance metrics like average accuracy, AUC and log loss have to be considered. A
common question that professionals often have when evaluating the performance of a
machine learning model is that which dataset they should use to measure the performance
of the machine learning model. Looking at the performance metrics on the trained dataset is
helpful but is not always right because the numbers obtained might be overly optimistic as
the model is already adapted to the training dataset. Machine learning model performances
should be measured and compared using validation and test sets to identify the best model
based on model accuracy and over-fitting.
All the above steps from 1 to 4 are iterated as data is acquired continuously and business
understanding become much clearer.

Example
In the retail company example, the data science team may test the predictive model using a
validation dataset to ensure that it accurately predicts customer purchase behavior.

Step 7: Model Deployment, operations, and optimization


The final step in the data science project life cycle is model deployment. This involves
deploying the predictive model into production so that it can be used to make predictions in
real-world scenarios. The deployment process involves integrating the model into the existing
business processes and systems to ensure that it can be used effectively.
Machine learning models might have to be recoded before deployment because data
scientists might favour Python programming language but the production environment
supports Java. After this, the machine learning models are first deployed in a pre-production
or test environment before actually deploying them into production.
This step involves developing a plan for monitoring and maintaining the data science project
in the long run. The model performance is monitored and performance downgrade is clearly
monitored in this phase. Data scientists can archive their learnings from a specific data
science projects for shared learning and to speed up similar data science projects in near
future.
This is the final phase of any data science project that involves retraining the machine learning
model in production whenever there are new data sources coming in or taking necessary
steps to keep up with the performance of the machine learning model.
Having a well-defined workflow for any data science project is less frustrating for any data
professional to work on. The lifecycle of a data science project mentioned above is not
definitive and can be altered accordingly to improve the efficiency of a specific data science
project as per the business requirements.

Example
In the retail company example, the data science team may deploy the predictive model into
the company’s customer relationship management (CRM) system so that it can be used to
make targeted marketing campaigns.

Conclusion
The data science project life cycle provides a structured approach for data scientists to
develop data-driven solutions that address specific business problems.
By following the steps outlined in the data science project life cycle, data scientists can ensure
that their projects are completed efficiently and effectively. This methodology enables data
scientists to deliver high-quality solutions that provide real value to the business.

You might also like