Using Pandas to Perform Data Analysis - IBM Developer
Using Pandas to Perform Data Analysis - IBM Developer
IBM Developer
Round 2 of the Call for Code Global Challenge is now open! Register to grow your AI skills and win prizes.
Site feedback
Using pandas to perform data analysis
Log in now
Python overview
You might think that Python is only for developers and people with computer science degrees.
However, Python is great for beginners, even those with little coding experience because it's
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 1/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
free, open source, and runs on any platform. The Python packages documentation is great,
and after an introductory course , you have a good foundation to build on.
Python is a general purpose and high-level programming language that is used for more than
working with data. For example, it's good for developing desktop GUI applications, websites,
and web applications. However, this tutorial focuses on the data and only goes through getting
started with data.
Introduction to pandas
pandas is an open source Python Library that provides high-performance data manipulation
and analysis. With the combination of Python and pandas, you can accomplish five typical
steps in the processing and analysis of data, regardless of the origin of data: load, prepare,
manipulate, model, and analyze.
There are many options when working with the data using pandas. The following list shows
some of the things that can be done using pandas.
Creating visualizations
This list is far from complete. See the pandas documentation for more of what you can do.
This tutorial walks you though some of the most interesting features of pandas using
structured data that contains information about the boroughs in London. You can download
the data used in the tutorial from data.gov.uk .
Prerequisites
To complete this tutorial, you need:
An IBM Cloud
IBM Watson Studio
Steps
2. Click Create Resource at the top of the Resources page. You'll see the resources under the
hamburger menu at the upper left.
5. Go back to the Resources list, click your Watson Studio service, and then click Get Started.
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 4/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
Notebook overview
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 5/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
The following list shows some of the capabilities within pandas that are covered in this
tutorial.
The notebook associated with this tutorial displays more elaborate functions of pandas.
Data exploration
Loading data
As long as the data is formatted consistently and has multiple records with numbers, text, or
dates, you can typically read the data with pandas. For example, a comma-separated text file
that is saved from a table in Excel can be loaded into a notebook with the following command.
import pandas as pd
df = pd.read_csv('data.csv')
Show more
You can load other formats of data files such as HTML, JSON, or Apache Parquet a similar
way.
Show Preferences
Cookie more
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 6/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
A DataFrame is a two-dimensional data structure. The data consists of rows and columns that
you can create in various ways. For example, you can load a file or use a NumPy array and a
date for the index. NumPy is a Python library for working with multidimensional arrays and
matrices with a large collection of mathematical functions to operate on these arrays.
The following code is an example of a DataFrame df1 with dates as the index, a 6x4 array of
random numbers as values, and column names A, B, C, and D.
Show more
Running the previous code generates an output similar to the following image. The notebook
shows a few more ways of creating a DataFrame.
Selecting data
To select data, you access a single row or groups of rows and columns with labels using
.loc[]. (This only works for the column that was set to the index.) Or, you select data by a
position with .iloc[]. Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 7/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
boroughs = df.copy()
boroughs = boroughs.set_index(['Code'])
boroughs.loc['E09000001', 'Inland_Area_(Hectares)']
Show more
boroughs.iloc[0]
Show more
boroughs.iloc[:,1]
Show more
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 8/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
Selecting rows based on a certain condition can be done with boolean indexing. This uses the
actual values of the data in the DataFrame as opposed to the row and column labels or index
positions. You can combine different columns using &, |, and == operators.
Data transformation
Cleaning data
When exploring data, there are always transformations needed to get it in the format that you
need for your analysis, visualization, or model. The best way to learn is to find a data set and
try to answer questions with the data. Some things to check when cleaning data are:
Is the data tidy, such as each variable forms a column, each observation forms a row, and
each type of observational unit forms a table?
Are all columns in the right data format?
Are there missing values?
Are there unrealistic outliers? Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 9/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
The following code shows how to add and drop a column that might not be required from the
data set using pandas.
boroughs['new'] = 1
boroughs.head()
Show more
boroughs = boroughs.drop(columns='new')
boroughs.head()
Show more
Merging data
pandas has several different options to combine or merge data. The documentation covers
these examples. In this notebook, you create two new DataFrames to explore how to merge
data. Then, you use append() to combine these DataFrames.
cities = cities.append(cities2)
cities
Show more
Grouping data
Grouping data is a quick way to calculate values for classes in your DataFrame.
boroughs.groupby(['Inner/Outer']).mean()
Show more
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 11/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
When you have multiple categorical variables, you can create a nested index.
boroughs.groupby(['Inner/Outer','Political control']).sum().head(8)
Show more
Data visualization
Visualization in pandas uses the Matplotlib library. This plotting library uses an object-
oriented API to embed plots into applications. Some of the examples are line plot, histograms,
scatter plot, image plot, and 3D plot. In this tutorial, we use matplotlib.pyplot, which is a
collection of command-style functions that make matplotlib work like MATLAB.
The following example shows a visualization of the employment rate through a histogram. You
can change the number of bins to get the wanted output on your histogram.
%matplotlib inline
boroughs = boroughs.reset_index()
boroughs['Employment_rate_(%)_(2015)'].plot.hist(bins=10);
Show more
Conclusion
This tutorial walked you through the steps to get IBM Cloud, Watson Studio, and a Jupyter
Notebook installed. It gave you an overview of the ways of analyzing data using pandas and a
notebook that you can run to try it yourself.
Legend
Categories
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 13/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
Table of Contents
Previous
Data preprocessing in detail
Next
Explore other data science topics
Related
Learning Path
Tutorial
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 14/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
Build
Third-party notice YouTube
Smart Explore
Build
Newsletters
Patterns
Secure APIs
Articles
Tutorials
Open source projects
Videos
Events
Community
Career Opportunities
Privacy
Terms of use
Accessibility
Cookie preferences
Sitemap
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 15/15