Python made easy with CyberGiant
Study Guide: Correlation Analysis in Python
1. Introduction to Correlation
Correlation measures the relationship between two variables. It
tells us if an increase in one variable results in an increase
or decrease in another variable.
● Positive Correlation: Both variables increase together.
● Negative Correlation: One variable increases while the
other decreases.
● No Correlation: No relationship between the variables.
The Pearson correlation coefficient (r) is a common measure:
● r=1r = 1r=1: Perfect positive correlation.
● r=−1r = -1r=−1: Perfect negative correlation.
● r=0r = 0r=0: No correlation.
2. Setting Up Python
First, ensure you have Python installed. Then, install the
required libraries. Open your terminal or command prompt and
type:
Copy code
pip install pandas seaborn matplotlib scipy
3. Writing the Python Code
Step 1: Import Libraries
We need several libraries for data manipulation, statistical
analysis, and visualization.
Copy code
import pandas as pd # For data handling
import seaborn as sns # For data visualization
import matplotlib.pyplot as plt # For plotting graphs
from scipy.stats import pearsonr # For statistical analysis
Step 2: Create Sample Data
Let's create a simple dataset with two variables: StudyHours and
TestScores.
Copy code
# Sample data
data = {
'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], # Number of
hours studied
'TestScores': [55, 60, 65, 70, 75, 80, 85, 90, 95, 100] #
Corresponding test scores
}
# Convert data to DataFrame
df = pd.DataFrame(data)
Step 3: Calculate Pearson Correlation Coefficient
The Pearson correlation coefficient measures the linear
relationship between two variables.
Copy code
# Calculate Pearson correlation coefficient
correlation, p_value = pearsonr(df['StudyHours'],
df['TestScores'])
print("Pearson Correlation Coefficient:", correlation) # Should
print a value close to 1
print("P-value:", p_value) # Should print a very small number
indicating significance
Explanation:
● correlation is a number between -1 and 1 that tells us the
strength and direction of the relationship.
● p_value tells us the significance of this correlation. A
small p-value (typically < 0.05) means the correlation is
significant.
Step 4: Visualize the Data
Visualization helps us see the relationship between the
variables.
Copy code
# Create scatter plot
sns.scatterplot(x='StudyHours', y='TestScores', data=df)
plt.title("Scatter Plot of Study Hours vs Test Scores")
plt.xlabel("Study Hours")
plt.ylabel("Test Scores")
plt.show()
Explanation:
● The scatter plot shows individual data points.
● If the points roughly form a straight line, it indicates a
strong linear relationship.
Step 5: Correlation Matrix
For datasets with more variables, a correlation matrix can show
the correlation between each pair of variables.
Copy code
# Add more variables for demonstration
df['PracticeTests'] = [2, 3, 1, 2, 3, 4, 2, 5, 4, 6] # Number
of practice tests taken
# Calculate correlation matrix
corr_matrix = df.corr()
# Visualize correlation matrix
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Heatmap of Correlation Matrix")
plt.show()
Explanation:
● df.corr() calculates the correlation matrix.
● The heatmap shows the strength of correlation with color
intensity.
4. Full Example Code
Here is the complete example with all steps combined:
Copy code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
# Step 1: Create sample data
data = {
'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'TestScores': [55, 60, 65, 70, 75, 80, 85, 90, 95, 100],
'PracticeTests': [2, 3, 1, 2, 3, 4, 2, 5, 4, 6]
}
# Step 2: Convert data to a DataFrame
df = pd.DataFrame(data)
# Step 3: Calculate Pearson correlation coefficient
correlation, p_value = pearsonr(df['StudyHours'],
df['TestScores'])
print("Pearson Correlation Coefficient:", correlation)
print("P-value:", p_value)
# Step 4: Create scatter plot
sns.scatterplot(x='StudyHours', y='TestScores', data=df)
plt.title("Scatter Plot of Study Hours vs Test Scores")
plt.xlabel("Study Hours")
plt.ylabel("Test Scores")
plt.show()
# Step 5: Calculate and visualize correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Heatmap of Correlation Matrix")
plt.show()
5. Interpreting the Results
● Pearson Correlation Coefficient: If the value is close to
1, it indicates a strong positive linear relationship.
● P-value: A small p-value (< 0.05) suggests the correlation
is statistically significant.
● Scatter Plot: Helps visualize the relationship between two
variables. A clear trend line indicates a strong
correlation.
● Heatmap: Visualizes the correlation matrix, where darker
shades indicate stronger correlations.