Explaratory Data Analysis - Python
Explaratory Data Analysis - Python
In computing, data is information that has been translated into a form that is efficient
for movement or processing.
Data Science
Facets of data
In data science and big data you’ll come across many different types of data, and each
of them tends to require different tools and techniques. The main categories of data
are these:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
• Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, it’s often easy to store structured data in tables within
databases or Excel files
• SQL, or Structured Query Language, is the preferred way to manage and query
data that resides in• databases.
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is your
regular email.
Natural language
• Natural language is a special type of unstructured data; it’s challenging
to process because it requires knowledge of specific data science
techniques and linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize
well to other domains.
• Even state-of-the-art techniques aren’t able to decipher the meaning of
every piece of text
Machine-generated data
• Machine-generated data is information that’s automatically created
by a computer, process, application, or other machine without human
intervention.
• Machine-generated data is becoming a major data resource and will
continue to do so.
• The analysis of machine data relies on highly scalable tools, due to its
high volume and speed. Examples of machine data are web server
logs , call detail records, network event logs, and telemetry.
• “Graph data” can be a confusing term because any data can be shown in a graph
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects
• The graph structures use nodes, edges, and properties to represent and store
graphical data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and
the shortest path between two people.
Streaming data
• The data flows into the system when an event happens instead of being loaded
into a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events,
and the stock market.
EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much
time to explore the data. Through the process of EDA, we can ask to define the problem
statement or definition on our data set which is very important. EDA involves a
comprehensive range of activities, including data integration, analysis, cleaning,
transformation, and dimension reduction.
This approach commonly utilizes data visualization techniques to gain insights and
identify relevant patterns, anomalies, and hypotheses, ultimately facilitating the
manipulation of data sources in order to obtain desired answers. EDA plays a crucial
role in assisting data scientists in making informed decisions, testing hypotheses, and
validating assumptions.
Significance of EDA
Exploratory Data Analysis (EDA) holds immense significance in the realm of data
science and analytics for several reasons:
In summary, EDA is a critical initial step in the data analysis process, enabling data
scientists and analysts to gain insights, clean and prepare the data, identify patterns
and relationships, and ultimately make informed decisions based on data-driven
evidence.
EDA techniques
1. Univariate Analysis: In EDA Analysis, univariate analysis examines individual
variables to understand their distributions and summary statistics.
2. Bivariate Analysis: This aspect of EDA explores the relationship between two
variables, uncovering patterns through techniques like scatter plots and
correlation analysis.
3. Multivariant analysis: Multivariate analysis extends bivariate evaluation to
encompass greater than variables. It ambitions to apprehend the complex
interactions and dependencies among more than one variable in a records set.
Techniques inclusive of heatmaps, parallel coordinates, aspect analysis, and
primary component analysis (PCA) are used for multivariate analysis.
4. Visualization Techniques: EDA relies heavily on visualization methods to
depict data distributions, trends, and associations using various charts and
graphs.
5. Outlier Detection: EDA involves identifying outliers within the data, anomalies
that deviate significantly from the rest, employing tools such as box plots and
z-score analysis.
6. Statistical Tests: EDA often includes performing statistical tests to validate
hypotheses or discern significant differences between groups, adding depth to
the analysis process.
Excel is a widely used spreadsheet software that offers basic data analysis and
visualization capabilities. It provides features such as pivot tables, charts, and
functions for summarizing, filtering, and analysing data.
RapidMiner:
MATLAB
Python and R stand out as the most prevalent tools in data science for conducting
EDA. Python, an interpreted, object-oriented programming language, is a powerful
tool for EDA, aiding in identifying missing values for machine learning. R, an open-
source statistical computing language, is widely adopted by statisticians for data
science in order to facilitate statistical observations and analysis.
2. Installing Python:
• Python Installation: The most recent version of Python can be downloaded
from the official website, python.org, if it is not already installed on your
computer.
An extensive library ecosystem is available for data analysis in Python. Three essential
libraries for EDA will be used:
• Pandas: pandas is a robust data manipulation library that offers functions and
data structures for handling structured data. Installing it with pip is possible:
Open your Python environment and use the following commands in Python script to
import these packages:
import pandas as pd
Your EDA toolkit is built on these three libraries. Now that you have Python and
these packages installed, you can begin examining and evaluating your data.
• Data Reading from CSV FilesPandas makes it simple to read data stored in
CSV (Comma-Separated Values) files. As follows:,,,import pandas as pd# Load
data from a CSV
data = pd.read_csv('your_data.csv')
It is important to see the data's structure and content as soon as you have loaded it
print(data.head())
data.iloc[:, 5:10]
iloc ‘to select columns by index. It selects all rows (:) and columns from index 5
(column 6) to index 9 (column 10), and then prints those columns.
data.iloc[4:10]
# Get basic information about the dataset, including data types and missing values
print(data.info())
The first few rows of your dataset will be shown by this code, along with important
details like the number of non-null entries and the data types of each column. These
preliminary checks aid in determining the quality of the data and provide you with an
understanding of what you have
print(data.shape())
print(data.count())
Inaccurate insights can arise from your analysis when there are missing data.
Appropriate handling of missing values is essential. Pandas offers multiple
approaches to deal with missing data:
missing_values = data.isnull().sum()
missing_values = data['column_name'].isnull().sum()
Eliminating Duplicate
Results from your dataset that contain duplicate records may be deceptive. To find
and eliminate duplicates:
data.drop_duplicates(inplace=True)
Pandas offers a quick and easy way to calculate summary statistics for your data. For
numerical columns, the describe () function provides a brief summary of important
statistics like mean, median, standard deviation, and quartiles:
summary_stats = data.describe()
A complete Python visualization toolkit for static, animated, and interactive graphics
is called Matplotlib. It provides many customization options to create different kinds
of plots, ranging from simple charts to intricate visualizations.
• Seaborn :
Built on top of Matplotlib, Seaborn offers a high-level interface for producing visually
appealing and educational statistical graphics. By offering functions for frequent
tasks, it streamlines the process of producing intricate visualizations.
# Create a histogram
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Visualizations like box or scatter plots are frequently the first step in the outlier
detection process.
• Box Plots
Box plots, which show the distribution of the data and highlight data points outside
the "whiskers" (outliers), can be useful in locating possible outliers.
sns.boxplot(x=data['numeric_column'], color='purple')
plt.show()
Summary:
Exploratory Data Analysis (EDA) is a critical starting point in any data science project,
enabling us to gain a deep understanding of our data's quality, patterns, and
relationships. Through the use of essential Python libraries like Pandas, Matplotlib, and
Seaborn, we can efficiently load, clean, and visualize data, revealing hidden insights
and guiding feature engineering and modeling decisions. EDA empowers us to identify
and address issues like missing data and outliers, enabling the creation of informative
data visualizations and ultimately enhancing our ability to make data-driven decisions
and effectively communicate findings to stakeholders.