PMT2 23
PMT2 23
PMT2 23
Change History
This problem builds on your knowledge of Pandas, base Python data structures, and using new tools. (Some
exercises require you to use very basic features of the networkx package, which is well documented.) It has
9 exercises, numbered 0 to 8. There are 17 available points. However, to earn 100% the threshold is 14
points. (Therefore, once you hit 14 points, you can stop. There is no extra credit for exceeding this
threshold.)
Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you
can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the
exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more
difficult exercises.
Code cells starting with the comment ### define demo inputs load results from prior exercises applied
to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work
properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in
terms of human readability). You are free to print or otherwise use Python to explore them, but we did not
print them in the starter code.
Solution (mt1-sp22.html)
Exercise 0 (1 point):
Before we can do any analysis, we have to read the data from the file it is stored in. We have defined
load_data and are using it to read from the data file.
In [1]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
def load_data(path):
import pandas as pd
return pd.read_csv(path, names=['film_id', 'film_name', 'actor', 'year'], sk
iprows=1)
The cell below will test your solution for Exercise 0. The testing variables will be available for debugging
under the following names in a dictionary format.
In [2]:
### test_cell_ex0
from tester_fw.testers import Tester_ex0
tester = Tester_ex0()
for _ in range(20):
try:
tester.run_test(load_data)
(input_vars, original_input_vars, returned_output_vars, true_output_var
s) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_var
s) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Exercise 1 (1 Point):
Next we need to explore our data. Complete the function explore_data to return a tuple, t, with the
following elements.
Each row in df indicates an instance of an actor starring in a film, so it is possible that there will be multiple
rows with the same 'film_name' and 'film_id'.
In [126]:
# 1 hour
## recap groupby and count
## convert df to dict using df.to_dict() - 'dict','list','series', 'split', 'rec
ords', 'index'
#
def explore_data(df):
###
### YOUR CODE HERE
###
t = []
# t[0]
t.append(df.shape)
# t[1]
t.append(df.head(5))
#t[2]
# 0. remove duplicates
df = df[['film_id', 'year']].drop_duplicates()
# 1. group df and count
film_grp = df.groupby(['year'])['film_id'].count()
film_grp = film_grp.reset_index() ## very impt!!
#2. set index
film_grp.set_index('year', inplace = True)
film_dict = film_grp.to_dict('dict') # dictionary of dictionaries
film_dict = film_dict['film_id'] # call out from nested dict
#3. append
t.append(film_dict)
return tuple(t)
((15, 4),
film_id film_name act
or \
8277 1599 Before I Fall Medalion Rahi
mi
6730 1150 A Million Ways to Die in the West Seth MacFarla
ne
5770 934 The Mortal Instruments: City of Bones Jamie Campbell Bow
er
10007 1883 Avengers: Infinity War Chris Pra
tt
9831 1855 Isle of Dogs Bob Balab
an
year
8277 2017
6730 2014
5770 2013
10007 2018
9831 2018 ,
{2011: 2, 2012: 1, 2013: 2, 2014: 1, 2016: 1, 2017: 3, 2018: 4, 2019: 1})
In [127]:
In [128]:
Out[128]:
((15, 4),
film_id film_name
actor \
8277 1599 Before I Fall Medalio
n Rahimi
6730 1150 A Million Ways to Die in the West Seth Ma
cFarlane
5770 934 The Mortal Instruments: City of Bones Jamie Campbe
ll Bower
10007 1883 Avengers: Infinity War Chr
is Pratt
9831 1855 Isle of Dogs Bob
Balaban
year
8277 2017
6730 2014
5770 2013
10007 2018
9831 2018 ,
{2011: 2, 2012: 1, 2013: 2, 2014: 1, 2016: 1, 2017: 3, 2018: 4, 201
9: 1})
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging
under the following names in a dictionary format.
In [129]:
### test_cell_ex1
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Exercise 2 (2 Points):
We will continue our exploration by identifying prolific actors. Complete the function top_10_actors to
accomplish the following:
In [203]:
def top_10_actors(df):
###
### YOUR CODE HERE
###
import pandas as pd
import numpy as np
# top 10 actor
top10actor = sorted_df[:10]
mincount = top10actor['count'].min()
return finaldf
actor count
0 Chloë Grace Moretz 8
1 Anna Kendrick 7
2 Jennifer Lawrence 7
3 Kevin Hart 7
4 Kristen Wiig 7
5 Melissa Leo 7
6 Melissa McCarthy 7
7 Ryan Reynolds 7
8 Bill Hader 6
9 Bryan Cranston 6
10 Christina Hendricks 6
11 Dan Stevens 6
12 Danny Glover 6
13 Idris Elba 6
14 James McAvoy 6
15 Maya Rudolph 6
16 Morgan Freeman 6
17 Nicolas Cage 6
18 Rose Byrne 6
19 Sylvester Stallone 6
Notice how all of the actors appearing in 6 or more movies are included.
In [204]:
In [205]:
actor count
312 Chloë Grace Moretz 8
106 Anna Kendrick 7
859 Jennifer Lawrence 7
1068 Kevin Hart 7
1097 Kristen Wiig 7
1305 Melissa Leo 7
1306 Melissa McCarthy 7
1659 Ryan Reynolds 7
184 Bill Hader 6
252 Bryan Cranston 6
337 Christina Hendricks 6
405 Dan Stevens 6
428 Danny Glover 6
740 Idris Elba 6
791 James McAvoy 6
1284 Maya Rudolph 6
1383 Morgan Freeman 6
1428 Nicolas Cage 6
1637 Rose Byrne 6
1795 Sylvester Stallone 6
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging
under the following names in a dictionary format.
In [206]:
### test_cell_ex2
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Exercise 3 (1 Point):
We will continue our exploration with a look at which years an actor has appeared in movies. Complete the
function actor_years to determine which years the given actor has appeared in movies based off of the
data in df. Your output should meet the following requirements:
Output is a dict mapping the actor's name to a list of integers (int) containing the years in
which this actor appeared in films.
There should not be any duplicate years. If an actor has appeared in one or more films in a year,
that year should be included once in the list.
The list of years should be sorted in ascending order.
In [321]:
# convert df to dict
lookupdict = df.stack().groupby('actor').apply(list).to_dict()
# return outputdict
outputdict = {k:v for k,v in lookupdict.items() if k == actor}
return outputdict
In [322]:
In [323]:
Out[323]:
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging
under the following names in a dictionary format.
In [324]:
### test_cell_ex3
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Exercise 4 (2 Points):
For our last exercise in exploration, we want to see some summary statistics on how many actors
participated in a movie. Complete the funciton movie_size_by_year to accomplish the following:
Determine the size of each film in terms of the number of actors in that film. In other words, if there
are X actors in film Y then the size of film Y is X .
For each year, determine the minimum, maximum, and mean sizes of films released that year. All
values in the "inner" dictionaries should be of type int.
Return the results as a nested dictionary
{year: {'min': minimum size, 'max': maximum size, 'mean': mean size (rounded to the
nearest integer)}}
In [469]:
# ~1hr.
# multiindex and aggregation
df = df[['year','film_id', 'actor']].drop_duplicates()
# year
year = size.groupby(['year']).agg({'actor':['min', 'max', 'mean']})
year.columns = year.columns.droplevel(0) # multiindex df after 2 groupbys
year['mean'] = year['mean'].round(decimals = 0).astype(int) # astype tends
to round down
d = year.to_dict('index')
return d
In [470]:
In [471]:
movie_size_by_year(demo_df_ex4)
Out[471]:
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging
under the following names in a dictionary format.
In [473]:
### test_cell_ex4
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Exercise 5 (4 Point):
We want to ultimately do some network analytics using this data. Our first task to that end is to define our
data in terms of a network. Here's the particulars of what we want in the network.
Complete the function make_network_dict to process the data from df into this graph structure. The
graph should be returned in a nested "dictionary of sets" structure.
The keys are actor names, and the values are a set of the key actor's co-stars.
To avoid storing duplicate data, all co-actors should be alphabetically after the key actor. If
following this rule results in an key actor having an empty set of costars, that actor should not be
included as a key actor. This means that actors who only appear in films without costars would not
be included.
For example {'Alice':{'Bob', 'Alice', 'Charlie'}, 'Bob':{'Alice',
'Bob', 'Charlie'}, 'Charlie: {'Alice', 'Bob', 'Charlie'}} indicates that
there is an edge between Alice and Bob, an edge between Bob and Charlie, and an edge
between Alice and Charlie. Instead of storing all the redundant information, we would store
just {'Alice': {'Bob', 'Charlie'}, 'Bob': {'Charlie'}}.
Hint: Think about how you could use merge to determine all pairs of costars. Once you have that,
you can worry about taking out the redundant information.
In [591]:
# referred to answer
def make_network_dict(df):
costars = {}
films = df[['film_id', 'actor']]
films2 = films.copy()
merged = films.merge(films2, how='outer', on='film_id')
merged = merged[merged['actor_x'] < merged['actor_y']]
# get subset of merged where actor x is alphabetically before actor y
costars = {k: set(g["actor_y"]) for k,g in merged.groupby("actor_x")}
return costars
return filmdict
In [592]:
In [593]:
Out[593]:
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging
under the following names in a dictionary format.
In [594]:
### test_cell_ex5
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Exercise 6 (1 Points):
Now that we have our dictionary which maps actor names to a set of that actor's costars, we are going to
use the networkx package to perform some graph analysis. The networkx framework is based on the
Graph object - a Graph holds data about the graph structure, which is made of nodes and edges among
other attributes. Your task for this exercise will be to add edges to a networkx.Graph object based on a
dict of sets.
Complete the function to_nx(dos). Your solution should iterate through the parameter dos, a dict which
maps actors to a set of their costars. For each costar pair implied by the input, add an edge to the Graph
object, g. We have provided some "wrapper" code to take care of constructing a Graph object, g, and
returning it. All you have to do is add edges to it.
Note: Check the networkx documentation to find how to add edges to a graph. Part of what this exercise
is evaluating is your ability to find, read, and understand information on new packages well enough to get
started performing its basic tasks. The information is easy to find and straight-forward in this case.
In [619]:
import networkx as nx
def to_nx(dos):
g = nx.Graph()
###
### YOUR CODE HERE
###
g = nx.Graph(dos)
return g
In [620]:
In [621]:
Out[621]:
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging
under the following names in a dictionary format.
In [622]:
### test_cell_ex6
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Exercise 7 (2 Points):
One thing that the networkx package makes relatively easy is calculating the degree of each of the nodes
in our graph. Here degree would be interpreted as the number of unique costars each actor has. If you have
a graph g then g.degree() will return an object that maps each node to its degree (see note).
Complete the function high_degree_actors(g, n): Given the inputs described below, determine the
degree of each actor in the graph, g. Return a pd.DataFrame with 2 columns ('actor' and 'degree'),
indicating an actor's name and degree. The output should have records for only the actors with the n highest
degrees. In the case of ties (two or more actors having the same degree), all of the actors with the lowest
included degree should be included. (for example if there's a 3-way tie for 10th place and n=10 then all 3 of
the actors involved in the tie should be included in the output). If n is None, all of the actors should be
included.
Sort your results by degree (descending order) and break ties (multiple actors w/ same degree) by sorting
them in alphabetical order based on the actor's name.
input g - a networkx graph object having actor names as nodes and edges indicating whether the
actors were costars based on our data.
input n - int indicating how many actors to return. This argument is optional for the user and has a
default value of None.
Note: One complication is that g.degree() isn't a dict. Keep in mind that it can be cast to a dict.
In [813]:
# referred to anwer - no idea why able to pass demo but not test cell
d = dict(g.degree())
import pandas as pd
df = pd.DataFrame(d.items())
df.columns = ['actor', 'degree']
if n is not None:
top_n_min = sorted_df['degree'].iloc[n-1]
return sorted_df[sorted_df['degree'] >= top_n_min]
else:
return sorted_df
actor degree
0 Elizabeth Banks 9
1 Emma Stone 9
2 Bradley Cooper 8
3 Anthony Mackie 7
4 Michael Peña 7
5 Maya Rudolph 6
6 Richard Jenkins 6
7 Stanley Tucci 6
8 Steve Carell 6
In [814]:
In [815]:
actor degree
0 Elizabeth Banks 9
1 Emma Stone 9
2 Bradley Cooper 8
3 Anthony Mackie 7
4 Michael Peña 7
5 Maya Rudolph 6
6 Richard Jenkins 6
7 Stanley Tucci 6
8 Steve Carell 6
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging
under the following names in a dictionary format.
In [816]:
### test_cell_ex7
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
In [817]:
input_vars, original_input_vars
Out[817]:
In [818]:
true_output_vars
Out[818]:
{'top_actors': actor degree
0 Paul Rudd 6
1 Steve Carell 5
2 Amy Poehler 4
3 Harrison Ford 4
4 Mark Ruffalo 4
5 Nicole Kidman 4
6 Catherine Keener 2
7 John Goodman 2
8 Matt Damon 2
9 Michael Caine 2
10 Tom Hanks 2}
Exercise 8 (3 Points):
In [801]:
returned_output_vars
Out[801]:
{'top_actors': actor degree
4 Chris Evans 4}
Another place where networkx shines is in its built-in graph algorithms, like community detection. We have
calculated the communities using networkx (check the docs for info on how to do this yourself) and have
the communities variable set to a list of sets (you can iterate over communities like a list, and each
set is the names of all the actors in one community).
Given
Complete the function notable_actors_in_comm. Your solution should accomplish the following:
We must handle cases where there are fewer than 10 actors in a community. In such
cases, all actors in the community should be included in the result without raising an error.
3. Output should be sorted in descending order of degree with ties (two or more actors with same
degree) broken by sorting alphabetically by actor name.
4. Include only actors with degree >= the 10th highest degree. This may mean that there are more
than 10 actors in the result.
5. The index of the result should be sequential numbers, starting with 0.
In [822]:
actor degree
0 Bryan Cranston 135
1 Anthony Mackie 116
2 Johnny Depp 115
3 Idris Elba 112
4 Joel Edgerton 109
5 James Franco 107
6 Jessica Chastain 107
7 Jeremy Renner 105
8 Chris Hemsworth 104
9 Zoe Saldana 104
In [823]:
In [824]:
actor degree
0 Bryan Cranston 135
1 Anthony Mackie 116
2 Johnny Depp 115
3 Idris Elba 112
4 Joel Edgerton 109
5 James Franco 107
6 Jessica Chastain 107
7 Jeremy Renner 105
8 Chris Hemsworth 104
9 Zoe Saldana 104
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging
under the following names in a dictionary format.
In [825]:
### test_cell_ex8
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
Fin. This is the end of the exam. If you haven't already, submit your work.