Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Python Data Analysis Cookbook
Python Data Analysis Cookbook

Python Data Analysis Cookbook: Clean, scrape, analyze, and visualize data with the power of Python!

eBook
AU$53.99 AU$60.99
Paperback
AU$75.99
Subscription
Free Trial
Renews at AU$24.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Python Data Analysis Cookbook

Chapter 2. Creating Attractive Data Visualizations

In this chapter, we will cover:

  • Graphing Anscombe's quartet
  • Choosing seaborn color palettes
  • Choosing matplotlib color maps
  • Interacting with IPython notebook widgets
  • Viewing a matrix of scatterplots
  • Visualizing with d3.js via mpld3
  • Creating heatmaps
  • Combining box plots and kernel density plots with violin plots
  • Visualizing network graphs with hive plots
  • Displaying geographical maps
  • Using ggplot2-like plots
  • Highlighting data points with influence plots

Introduction

Data analysis is more of an art than a science. Creating attractive visualizations is an integral part of this art. Obviously, what one person finds attractive, other people may find completely unacceptable. Just as in art, in the rapidly evolving world of data analysis, opinions, and taste change over time; however, in principle, nobody is absolutely right or wrong. As data artists and Pythonistas, we can choose from among several libraries of which I will cover matplotlib, seaborn, Bokeh, and ggplot. Installation instructions for some of the packages we use in this chapter were already covered in Chapter 1, Laying the Foundation for Reproducible Data Analysis, so I will not repeat them. I will provide an installation script (which uses pip only) for this chapter; you can even use the Docker image I described in the previous chapter. I decided to not include the Proj cartography library and the R-related libraries in the image because of their size. So for the two recipes...

Graphing Anscombe's quartet

Anscombe's quartet is a classic example that illustrates why visualizing data is important. The quartet consists of four datasets with similar statistical properties. Each dataset has a series of x values and dependent y values. We will tabulate these metrics in an IPython notebook. However, if you plot the datasets, they look surprisingly different compared to each other.

How to do it...

For this recipe, you need to perform the following steps:

  1. Start with the following imports:
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    from dautil import report
    from dautil import plotting
    import numpy as np
    from tabulate import tabulate
  2. Define the following function to compute the mean, variance, and correlation of x and y within a dataset, the slope, and the intercept of a linear fit for each of the datasets:
    df = sns.load_dataset("anscombe")
    
        agg = df.groupby('dataset')\
                 .agg...

Choosing seaborn color palettes

Seaborn color palettes are similar to matplotlib colormaps. Color can help you discover patterns in data and is an important visualization component. Seaborn has a wide range of color palettes, which I will try to visualize in this recipe.

How to do it...

  1. The imports are as follows:
    import seaborn as sns
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    import numpy as np
    from dautil import plotting
  2. Use the following function that helps plot the palettes:
    def plot_palette(ax, plotter, pal, i, label, ncol=1):
        n = len(pal)
        x = np.linspace(0.0, 1.0, n)
        y = np.arange(n) + i * n
        ax.scatter(x, y, c=x, 
                    cmap=mpl.colors.ListedColormap(list(pal)), 
                    s=200)
        plotter.plot(x,y, label=label)
        handles, labels = ax.get_legend_handles_labels()
        ax.legend(loc='best', ncol=ncol, fontsize=18)
  3. Categorical palettes are useful for categorical data, for instance, gender or blood type. The following function plots...

Choosing matplotlib color maps

The matplotlib color maps are getting a lot of criticism lately because they can be misleading; however, most colormaps are just fine in my opinion. The defaults are getting a makeover in matplotlib 2.0 as announced at http://matplotlib.org/style_changes.html (retrieved July 2015). Of course, there are some good arguments that do not support using certain matplotlib colormaps, such as jet. In art, as in data analysis, almost nothing is absolutely true, so I leave it up to you to decide. In practical terms, I think it is important to consider how to deal with print publications and the various types of color blindness. In this recipe, I visualize relatively safe colormaps with colorbars. This is a tiny selection of the many colormaps in matplotlib.

How to do it...

  1. The imports are as follows:
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    from dautil import plotting
  2. Plot the datasets with the following code:
    fig, axes = plt.subplots(4, 4)
    cmaps = [&apos...

Interacting with IPython Notebook widgets

Interactive IPython notebook widgets are, at the time of writing (July 2015), an experimental feature. I, and as far as I know, many other people, hope that this feature will remain. In a nutshell, the widgets let you select values as you would with HTML forms. This includes sliders, drop-down boxes, and check boxes. As you can read, these widgets are very convenient for visualizing the weather data I introduced in Chapter 1, Laying the Foundation for Reproducible Data Analysis.

How to do it...

  1. Import the following:
    import seaborn as sns
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from IPython.html.widgets import interact
    from dautil import data
    from dautil import ts
  2. Load the data and request inline plots:
    %matplotlib inline
    df = data.Weather.load()
  3. Define the following function, which displays bubble plots:
    def plot_data(x='TEMP', y='RAIN', z='WIND_SPEED', f='A', size=10, cmap='Blues...

Viewing a matrix of scatterplots

If you don't have many variables in your dataset, it is a good idea to view all the possible scatterplots for your data. You can do this with one function call from either seaborn or pandas. These functions display a matrix of plots with kernel density estimation plots or histograms on the diagonal.

How to do it...

  1. Imports the following:
    import pandas as pd
    from dautil import data
    from dautil import ts
    import matplotlib.pyplot as plt
    import seaborn as sns
    import matplotlib as mpl
  2. Load the weather data with the following lines:
    df = data.Weather.load()
    df = ts.groupby_yday(df).mean()
    df.columns = [data.Weather.get_header(c) for c in df.columns]
  3. Plot with the Seaborn pairplot() function, which plots histograms on the diagonal by default:
    %matplotlib inline
    
    # Seaborn plotting, issues due to NaNs
    sns.pairplot(df.fillna(0))

    The following plots are the result:

    How to do it...
  4. Plot similarly with the pandas scatter_matrix() function and request kernel density estimation plots on...

Visualizing with d3.js via mpld3

D3.js is a JavaScript data visualization library released in 2011, which we can also use in an IPython notebook. We will add hovering tooltips to a regular matplotlib plot. As a bridge, we need the mpld3 package. This recipe doesn't require any JavaScript coding whatsoever.

Getting ready

I installed mpld3 0.2 with the following command:

$ [sudo] pip install mpld3

How to do it...

  1. Start with the imports and enable mpld3:
    %matplotlib inline
    import matplotlib.pyplot as plt
    import mpld3
    mpld3.enable_notebook()
    from mpld3 import plugins
    import seaborn as sns
    from dautil import data
    from dautil import ts
  2. Load the weather data and plot it as follows:
    df = data.Weather.load()
    df = df[['TEMP', 'WIND_SPEED']]
    df = ts.groupby_yday(df).mean()
    
    fig, ax = plt.subplots()
    ax.set_title('Averages Grouped by Day of Year')
    points = ax.scatter(df['TEMP'], df['WIND_SPEED'],
                        s=30, alpha=0.3)
    ax.set_xlabel(data...

Creating heatmaps

Heat maps visualize data in a matrix using a set of colors. Originally, heat maps were used to represent prices of financial assets, such as stocks. Bokeh is a Python package that can display heatmaps in an IPython notebook or produce a standalone HTML file.

Getting ready

I have Bokeh 0.9.1 via Anaconda. The Bokeh installation instructions are available at http://bokeh.pydata.org/en/latest/docs/installation.html (retrieved July 2015).

How to do it...

  1. The imports are as follows:
    from collections import OrderedDict
    from dautil import data
    from dautil import ts
    from dautil import plotting
    import numpy as np
    import bokeh.plotting as bkh_plt
    from bokeh.models import HoverTool
  2. The following function loads temperature data and groups it by year and month:
    def load():
        df = data.Weather.load()['TEMP']
        return ts.groupby_year_month(df)
  3. Define a function that rearranges data in a special Bokeh structure:
    def create_source():
        colors = plotting.sample_hex_cmap()
    
        month...

Combining box plots and kernel density plots with violin plots

Violin plots combine box plots and kernel density plots or histograms in one type of plot. Seaborn and matplotlib both offer violin plots. We will use Seaborn in this recipe on z-scores of weather data. The z-scoring is not essential, but without it, the violins will be more spread out.

How to do it...

  1. Import the required libraries as follows:
    import seaborn as sns
    from dautil import data
    import matplotlib.pyplot as plt
  2. Load the weather data and calculate z-scores:
    df = data.Weather.load()
    zscores = (df - df.mean())/df.std()
  3. Plot a violin plot of the z-scores:
    %matplotlib inline
    plt.figure()
    plt.title('Weather Violin Plot')
    sns.violinplot(zscores.resample('M'))
    plt.ylabel('Z-scores')

    Refer to the following plot for the first violin plot:

    How to do it...
  4. Plot a violin plot of rainy and dry (the opposite of rainy) days against wind speed:
    plt.figure()
    plt.title('Rainy Weather vs Wind Speed')
    categorical = df
    categorical...

Visualizing network graphs with hive plots

A hive plot is a visualization technique for plotting network graphs. In hive plots, we draw edges as curved lines. We group nodes by some property and display them on radial axes. NetworkX is one of the most famous Python network graph libraries; however, it doesn't support hive plots yet (July 2015). Luckily, several libraries exist that specialize in hive plots. Also, we will use an API to partition the graph of Facebook users available at https://snap.stanford.edu/data/egonets-Facebook.html (retrieved July 2015). The data belongs to the Stanford Network Analysis Project (SNAP), which also has a Python API. Unfortunately, the SNAP API doesn't support Python 3 yet.

Getting ready

I have NetworkX 1.9.1 via Anaconda. The instructions to install NetworkX are at https://networkx.github.io/documentation/latest/install.html (retrieved July 2015). We also need the community package at https://bitbucket.org/taynaud/python-louvain (retrieved July...

Introduction


Data analysis is more of an art than a science. Creating attractive visualizations is an integral part of this art. Obviously, what one person finds attractive, other people may find completely unacceptable. Just as in art, in the rapidly evolving world of data analysis, opinions, and taste change over time; however, in principle, nobody is absolutely right or wrong. As data artists and Pythonistas, we can choose from among several libraries of which I will cover matplotlib, seaborn, Bokeh, and ggplot. Installation instructions for some of the packages we use in this chapter were already covered in Chapter 1, Laying the Foundation for Reproducible Data Analysis, so I will not repeat them. I will provide an installation script (which uses pip only) for this chapter; you can even use the Docker image I described in the previous chapter. I decided to not include the Proj cartography library and the R-related libraries in the image because of their size. So for the two recipes involved...

Graphing Anscombe's quartet


Anscombe's quartet is a classic example that illustrates why visualizing data is important. The quartet consists of four datasets with similar statistical properties. Each dataset has a series of x values and dependent y values. We will tabulate these metrics in an IPython notebook. However, if you plot the datasets, they look surprisingly different compared to each other.

How to do it...

For this recipe, you need to perform the following steps:

  1. Start with the following imports:

    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    from dautil import report
    from dautil import plotting
    import numpy as np
    from tabulate import tabulate
  2. Define the following function to compute the mean, variance, and correlation of x and y within a dataset, the slope, and the intercept of a linear fit for each of the datasets:

    df = sns.load_dataset("anscombe")
    
        agg = df.groupby('dataset')\
                 .agg([np.mean, np.var])\
             ...

Choosing seaborn color palettes


Seaborn color palettes are similar to matplotlib colormaps. Color can help you discover patterns in data and is an important visualization component. Seaborn has a wide range of color palettes, which I will try to visualize in this recipe.

How to do it...

  1. The imports are as follows:

    import seaborn as sns
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    import numpy as np
    from dautil import plotting
  2. Use the following function that helps plot the palettes:

    def plot_palette(ax, plotter, pal, i, label, ncol=1):
        n = len(pal)
        x = np.linspace(0.0, 1.0, n)
        y = np.arange(n) + i * n
        ax.scatter(x, y, c=x, 
                    cmap=mpl.colors.ListedColormap(list(pal)), 
                    s=200)
        plotter.plot(x,y, label=label)
        handles, labels = ax.get_legend_handles_labels()
        ax.legend(loc='best', ncol=ncol, fontsize=18)
  3. Categorical palettes are useful for categorical data, for instance, gender or blood type. The following function plots some of...

Left arrow icon Right arrow icon

Key benefits

  • Analyze Big Data sets, create attractive visualizations, and manipulate and process various data types
  • Packed with rich recipes to help you learn and explore amazing algorithms for statistics and machine learning
  • Authored by Ivan Idris, expert in python programming and proud author of eight highly reviewed books

Description

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning. Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You’ll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining. In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code. By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.

Who is this book for?

This book teaches Python data analysis at an intermediate level with the goal of transforming you from journeyman to master. Basic Python and data analysis skills and affinity are assumed.

What you will learn

  • Set up reproducible data analysis
  • Clean and transform data
  • Apply advanced statistical analysis
  • Create attractive data visualizations
  • Web scrape and work with databases, Hadoop, and Spark
  • Analyze images and time series data
  • Mine text and analyze social networks
  • Use machine learning and evaluate the results
  • Take advantage of parallelism and concurrency

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 22, 2016
Length: 462 pages
Edition : 1st
Language : English
ISBN-13 : 9781785283857
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jul 22, 2016
Length: 462 pages
Edition : 1st
Language : English
ISBN-13 : 9781785283857
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
AU$24.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
AU$249.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts
AU$349.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total AU$ 234.97
Python Machine Learning Cookbook
AU$90.99
Python Data Analysis Cookbook
AU$75.99
Advanced Machine Learning with Python
AU$67.99
Total AU$ 234.97 Stars icon

Table of Contents

17 Chapters
1. Laying the Foundation for Reproducible Data Analysis Chevron down icon Chevron up icon
2. Creating Attractive Data Visualizations Chevron down icon Chevron up icon
3. Statistical Data Analysis and Probability Chevron down icon Chevron up icon
4. Dealing with Data and Numerical Issues Chevron down icon Chevron up icon
5. Web Mining, Databases, and Big Data Chevron down icon Chevron up icon
6. Signal Processing and Timeseries Chevron down icon Chevron up icon
7. Selecting Stocks with Financial Data Analysis Chevron down icon Chevron up icon
8. Text Mining and Social Network Analysis Chevron down icon Chevron up icon
9. Ensemble Learning and Dimensionality Reduction Chevron down icon Chevron up icon
10. Evaluating Classifiers, Regressors, and Clusters Chevron down icon Chevron up icon
11. Analyzing Images Chevron down icon Chevron up icon
12. Parallelism and Performance Chevron down icon Chevron up icon
A. Glossary Chevron down icon Chevron up icon
B. Function Reference Chevron down icon Chevron up icon
C. Online Resources Chevron down icon Chevron up icon
D. Tips and Tricks for Command-Line and Miscellaneous Tools Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
(2 Ratings)
5 star 0%
4 star 0%
3 star 100%
2 star 0%
1 star 0%
Dimitri Shvorob Oct 31, 2016
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
The author has kindly made the code in the book publicly available - I will ungratefully advise interested readers to get the code and skip the book. (Python beginners should skip both). "Python Data Analysis Cookbook" is a typical low-quality Packt "book product" - I don't want to call these things "books" - which packages, but does not add much value to, a ragtag but large collection of Python code. The considerable page count should be heavily discounted - first, because a Packt page, ahem, packs less text than a page in a book from a regular publisher; second, because the author supplies a self-contained code sample, with a 2-by-2 plot visualization, for most of his "recipes". (It might take a single line to calculate a correlation coefficient, but the code will take a page, and the large chart will take another). Even so, you get a whole lot of code, doing all sorts of things, from Fourier transforms to manipulating database tables. (And, consistently, plotting - but using the author's own "dalib" Python package. This means that you will have to stick with "dalib" as well). This scattershot approach is a double-edged sword: it is likely that you will find *something* useful, but it is also likely that *most* of the book will be a dead weight. And then you ask how well that useful something is explained, and how easy would it be to find something similar online... A "pass" from me; review the table of contents to make up your mind.
Amazon Verified review Amazon
Tseliso Nov 03, 2016
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Not as good as one expected.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.

Modal Close icon
Modal Close icon