Essential
Python for
Data Scientists
A step-by-step roadmap
Dawn Choo
Did you know 92% of
Data Science jobs
require Python?
Here are essential
Python skills for
Data Scientists
1 Learn Python fundamentals
Key concepts
Variables and data types
type()
int(), float(), str()
list(), dict()
Control structures
if, elif, else
for loop
while loop
range()
Functions
def
return
args
List comprehensions
[expression for item in iterable if condition]
1 Learn Python fundamentals
Test your skills
Exercise 1
Implement a function to generate random
even numbers.
Exercise 2
Create a list comprehension to extract
vowels from a given string.
Exercise 3
Write a function that uses a loop to
calculate the factorial of a number.
2 Data Manipulation
Key concepts
Libraries: Numpy (np) & Pandas (pd)
Working with arrays
np.array()
np.reshape()
np.concatenate()
DataFrame operations
pd.DataFrame()
df.head(), df.tail()
df.info(), df.describe()
Data selection and filtering
df.loc[], df.iloc[]
Boolean indexing
df.query()
2 Data Manipulation
Key concepts
Libraries: Numpy (np) & Pandas (pd)
Data cleaning
df.dropna(), df.fillna()
df.drop_duplicates()
df.replace()
Merging and reshaping data
pd.merge()
df.pivot()
df.melt()
Grouping and aggregation
df.groupby()
df.agg()
2 Data Manipulation
Test your skills
For these exercises, use any dataset you like on Kaggle.
Exercise 1
Handle missing values and remove
duplicates in a customer dataset.
Exercise 2
Combine multiple related datasets using a
common key, then calculate summary
statistics for each group.
Exercise 3
Transform a dataset from wide format to
long format, creating new 'variable' and
'value' columns.
3 Exploratory Data Analysis
Key concepts
Libraries: Pandas (pd), Matplotlib (plt) & SciPy
Descriptive statistics
df.mean(), df.median(), df.mode()
df.std(), df.var()
df.min(), df.max(), df.quantile()
Data distribution
df.hist()
plt.hist()
scipy.stats.normaltest()
Correlation analysis
df.corr()
plt.imshow() (for heatmaps)
scipy.stats.pearsonr()
3 Exploratory Data Analysis
Key concepts
Libraries: Pandas (pd), Matplotlib (plt) & SciPy
Outlier detection
plt.boxplot()
scipy.stats.zscore()
IQR method using numpy percentile
Time series analysis basics
df.resample()
df.rolling()
Plotting with plt.plot()
Basic hypothesis testing
scipy.stats.ttest_ind()
scipy.stats.chi2_contingency()
3 Exploratory Data Analysis
Test your skills
For these exercises, use any dataset you like on Kaggle.
Exercise 1
Calculate and visualize basic descriptive
statistics for numerical columns in the
dataset.
Exercise 2
Analyze the distribution of key variables
using histograms and test for normality.
Exercise 3
Identify and visualize correlations between
variables, highlighting strong relationships.
4 Data Visualization
Key concepts
Libraries: Matplotlib (plt) & Pandas
Basic plotting
plt.plot() (line plots)
plt.scatter() (scatter plots)
plt.bar() (bar charts)
Histograms and density plots
plt.hist()
plt.kde()
Box plots
plt.boxplot()
Subplots and multiple charts
plt.subplots()
fig.add_subplot()
Customizing plots
plt.xlabel(), plt.ylabel(), plt.title()
plt.xscale(), plt.yscale()
plt.legend()
4 Data Visualization
Test your skills
For these exercises, use any dataset you like on Kaggle.
Exercise 1
Compare the distributions of several
numerical variables using box plots and
histograms.
Exercise 2
Visualize the relationship between two
continuous variables with a scatter plot,
adding a trend line and confidence
interval.
Exercise 3
Design a stacked bar chart to show the
composition of categories across
different groups in the dataset.
5 Machine learning basics
Key concepts
Libraries: Scikit-learn (sklearn)
Model training and evaluation
sklearn.model_selection.train_test_split()
sklearn.base.BaseEstimator.fit(),
sklearn.base.BaseEstimator.predict()
sklearn.model_selection.cross_val_score()
Regression models
sklearn.linear_model.LinearRegression()
sklearn.metrics.mean_squared_error()
sklearn.metrics.r2_score()
Classification models
sklearn.linear_model.LogisticRegression()
sklearn.metrics.accuracy_score()
sklearn.metrics.confusion_matrix()
Clustering
sklearn.cluster.KMeans()
sklearn.metrics.silhouette_score()
5 Machine learning basics
Test your skills
For these exercises, use any dataset you like on Kaggle.
Exercise 1
Split the dataset into training and test
sets, then build and evaluate a linear
regression model to predict a continuous
target variable.
Exercise 2
Implement a logistic regression classifier,
use cross-validation to assess its
performance, and interpret the model
coefficients.
Exercise 3
Perform k-means clustering on the
dataset, determine the optimal number of
clusters, and visualize the results.
Have any questions?
Share them in the comments below!
Found this
useful?
Save it
Follow me
Repost it Dawn Choo