Pandas_numpy_handing_data
Pandas_numpy_handing_data
• look at the value counts of genres from songs over 2,000,000 ms:
• itunes_df[itunes_df['Milliseconds'] > 2e6]['Genre'].value_counts()
• This uses filtering with the isin method. The isin method checks if each
value is in the list or set provided to the function. In this case, we also
negate this condition with the tilde (~) so that any of the non -music
genres are excluded, and our only_music DataFrame has only genres
that are music, just as the variable name suggests
Missing values
• For example, we saw that our Composer column has several
missing values. We can use filtering to see what some of
these rows: itunes_df[itunes_df['Composer'].isna()].sample(5,
random_state=42).head()
• Another option is to drop the missing values. We can either
drop the entire column, as we did earlier, or we can drop the
rows with missing values: itunes_df.dropna(inplace=True)
• Filling with a specific value could be done
• itunes_df.loc[itunes_df['Composer'].isna(), 'Composer'] =
'Unknown'
• itunes_df['Composer'].fillna('Unknown', inplace=True)
Missing values
• Using KNN imputation
Duplicate values
• to check for duplicates is the duplicated() function:
itunes_df.duplicated().sum()
Or
def lowercase(x):
return x.lower()
itunes_df['Genre'].apply(lowercase)
Or
itunes_df['Genre'].str.lower()
Groupby
• Groupby in pandas is just like in SQL – group by unique
values in a column
itunes_df.groupby('Genre').mean()['Seconds'].sort_values().head()
Writing DataFrame to disk
• Pandas offers many ways to save data: csv, excel, hdf5, ..
itunes_df.to_csv('data/saved_itunes_data.csv', index=False)