The document discusses Exploratory Data Analysis (EDA), which is an approach for summarizing, visualizing, and understanding the important characteristics of a dataset. EDA helps to ensure the best outcomes for a machine learning project by determining which algorithms and features are most suitable. It involves univariate, bivariate, and multivariate visualization techniques as well as dimensionality reduction. Skipping EDA can lead to inaccurate models, choosing the wrong variables, or inefficient use of resources. Data profiling is an important part of EDA for assessing data quality.
The document discusses Exploratory Data Analysis (EDA), which is an approach for summarizing, visualizing, and understanding the important characteristics of a dataset. EDA helps to ensure the best outcomes for a machine learning project by determining which algorithms and features are most suitable. It involves univariate, bivariate, and multivariate visualization techniques as well as dimensionality reduction. Skipping EDA can lead to inaccurate models, choosing the wrong variables, or inefficient use of resources. Data profiling is an important part of EDA for assessing data quality.
The document discusses Exploratory Data Analysis (EDA), which is an approach for summarizing, visualizing, and understanding the important characteristics of a dataset. EDA helps to ensure the best outcomes for a machine learning project by determining which algorithms and features are most suitable. It involves univariate, bivariate, and multivariate visualization techniques as well as dimensionality reduction. Skipping EDA can lead to inaccurate models, choosing the wrong variables, or inefficient use of resources. Data profiling is an important part of EDA for assessing data quality.
The document discusses Exploratory Data Analysis (EDA), which is an approach for summarizing, visualizing, and understanding the important characteristics of a dataset. EDA helps to ensure the best outcomes for a machine learning project by determining which algorithms and features are most suitable. It involves univariate, bivariate, and multivariate visualization techniques as well as dimensionality reduction. Skipping EDA can lead to inaccurate models, choosing the wrong variables, or inefficient use of resources. Data profiling is an important part of EDA for assessing data quality.
Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1of 12
EDA
What is Exploratory Data Analysis (EDA)
• How to ensure you are ready to use machine learning algorithms in a project? • How to choose the most suitable algorithms for your data set? • How to define the feature variables that can potentially be used for machine learning?
Exploratory Data Analysis (EDA) helps to answer all these questions,
ensuring the best outcomes for the project. It is an approach for summarizing, visualizing, and becoming intimately familiar with the important characteristics of a data set. Value of Exploratory Data Analysis • It allows to get closer to the certainty that the future results will be valid, correctly interpreted, and applicable to the desired business contexts • Such level of certainty can be achieved only after raw data is validated and checked for anomalies, ensuring that the data set was collected without errors • EDA also helps to find insights that were not evident or worth investigating to business stakeholders and data scientists but can be very informative about a particular business Methods of Exploratory Data Analysis • Univariate visualization — provides summary statistics for each field in the raw data set • Bivariate visualization — is performed to find the relationship between each variable in the dataset and the target variable of interest • Multivariate visualization — is performed to understand interactions between different fields in the dataset • Dimensionality reduction — helps to understand the fields in the data that account for the most variance between observations and allow for the processing of a reduced volume of data. Why skipping Exploratory Data Analysis is a bad idea? • generating inaccurate models; • generating accurate models on the wrong data; • choosing the wrong variables for the model; • inefficient use of the resources, including the rebuilding of the model. One of the important things about EDA is Data profiling. • Data profiling is concerned with summarizing your dataset through descriptive statistics. You want to use a variety of measurements to better understand your dataset. The goal of data profiling is to have a solid understanding of your data so you can afterwards start querying and visualizing your data in various ways. However, this doesn’t mean that you don’t have to iterate: exactly because data profiling is concerned with summarizing your dataset, it is frequently used to assess the data quality. Depending on the result of the data profiling, you might decide to correct, discard or handle your data differently. Key Concepts of Exploratory Data Analysis 2 types of Data Analysis • Confirmatory Data Analysis • Exploratory Data Analysis 4 Objectives of EDA • Discover Patterns • Spot Anomalies • Frame Hypothesis • Check Assumptions 2 methods for exploration • Univariate Analysis • Bivariate Analysis Key Concepts of Exploratory Data Analysis….cont Stuff done during EDA • Trends • Distribution • Mean • Median • Outlier • Spread measurement (SD) • Correlations • Hypothesis testing • Visual Exploration SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess. It is a list of sequential steps developed by SAS Institute, one of the largest producers of statistics and business intelligence software. It guides the implementation of data mining applications. Although SEMMA is often considered to be a general data mining methodology, SAS claims that it is "rather a logical organization of the functional tool set of" one of their products, SAS Enterprise Miner, "for carrying out the core tasks of data mining" Phases of SEMMA • Sample. The process starts with data sampling, e.g., selecting the data set for modeling. The data set should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning. • Explore. This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables, and also abnormalities, with the help of data visualization. • Modify. The Modify phase contains methods to select, create and transform variables in preparation for data modeling. • Model. In the Model phase the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome. • Assess. The last phase is Assess. The evaluation of the modeling results shows the reliability and usefulness of the created models. Mode function • getmode <- function(v) { • uniqv <- unique(v) • uniqv[which.max(tabulate(match(v, uniqv)))] • } • # Create the vector with numbers. • v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3) • # Calculate the mode using the user function. • result <- getmode(v) • print(result) • # Create the vector with characters. • charv <- c("o","it","the","it","it") • # Calculate the mode using the user function. • result <- getmode(charv) • print(result)