0% found this document useful (0 votes)
13 views29 pages

Strava Dataset Analysis by Python

The document provides a comprehensive analysis of Strava datasets using Python, SQL, and Power-BI, focusing on user activity metrics such as steps, distance, and calories. Key findings include high correlations between total steps and distance, insights into user behavior patterns, and suggestions for improving user engagement and subscription models. The analysis emphasizes the performance gap between casual and dedicated users, highlighting the importance of consistent activity for fitness tracking applications.

Uploaded by

ffhunter7666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views29 pages

Strava Dataset Analysis by Python

The document provides a comprehensive analysis of Strava datasets using Python, SQL, and Power-BI, focusing on user activity metrics such as steps, distance, and calories. Key findings include high correlations between total steps and distance, insights into user behavior patterns, and suggestions for improving user engagement and subscription models. The analysis emphasizes the performance gap between casual and dedicated users, highlighting the importance of consistent activity for fitness tracking applications.

Uploaded by

ffhunter7666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Strava Dataset Analysis by Python

Github-Link- https://github.com/akshithedau/Strava_App_Analysis/
Tools Used-
The following tools are used in the python analysis-
1) Jupyter Notebook.
2) Python version 3.12 (64 bit) app.
3) Anaconda Shell Navigator.
4) Git Hub.
5) Command Prompt.
Libraries used-
pandas, numpy, matplotlib, matplotlib. Pyplot, seaborn, missingno
plotly.express, PIL import Image, IPython.display

Datasets Used-
1) ("dailyActivity_merged.csv")
2) ("dailyCalories_merged.csv")
3) ("dailyIntensities_merged.csv")
4) ("dailySteps_merged.csv")
5) ("sleepDay_merged.csv")
6) ("weightLogInfo_merged.csv")
All above datasets are loaded on Jupyter Notebook and then merged into single
dataset named as md.
This helps in reducing cluttered data into simpler and easy to analyze dataset.
Python Analysis-
The following analysis shows the count of Total Steps for activity date 4/12/2016
which is 8236.

The following analysis shows the count of Total Distance for activity date
4/12/2016 which is 5.
The following analysis shows the count of Total Steps group by their unique Id.

The following analysis shows the count of Total Distance group by their unique Id.
The following analysis shows the Top 10 Total Step count distribution group by Id.

The following analysis shows the count of sleep record group by their unique Id
which is 33.
The following analysis shows the comparison between numerical variables.
The following analysis shows the distribution of logged activity distance group by
their unique tracked distance.

The following analysis shows the distribution of calories_x group by their unique
top 10 total step count.
The following analysis shows the distribution of calories_y group by their unique
top 10 total step count.

The following analysis shows the top 15 total step count trend over a Period of
time for activity date 4/12/2016.
The following analysis shows the top 15 total distance count trend over a Period of
time for activity date 4/12/2016.

The following analysis shows the top 15 total Tracked distance count trend over a
Period of time for activity date 4/12/2016.
The following analysis shows the Heat map python analysis which is useful for
defining the correlation between different numerical variables.
The following are the insights for the below heatmap chart-
1. The highest correlation is between Total distance and Total steps with value
of 0.98.
2. Second highest correlation is between moderately active distance _x and
Fairly active minutes _x with value of 0.97.
3. Third highest correlation is between Very active distance _x and very active
minutes_x with value of 0.88.
The following analysis shows the pair plot which explains the trend analysis for
each numerical variable by comparing it with each other and by itself.
The following are the insights for the below pair plot-
1. Every single variable when compared with itself is showing steady straight
slant increasing line which shows consistency.
2. When variables are compared with each other they are showing depreciating
trend after certain point.
Business Objectives and Problem Solution-
The following are the suggestions for the business to improve and achieve their
targets and solve business problems-
1. The strava app can encourage people to track their distance as people are
completing lot of steps per day but not logging in or tracking distance on app
which ultimately will not generate fair manual report at the end of month.
2. Strava app can organize campaigns, seminars, webinars or rewarding system
to encourage people join their app to achieve fitness targets, which
ultimately will make people fit as well as increase profits for app.
3. Strava app can make a system if certain level of target is achieved by the
user the user will get offer or discount on paid membership/ subscription,
this will encourage people to stay fit and health and also help strava app to
gain those extra premium users, ultimately increasing profits for the
company.

Conclusion-
The following are the conclusion for the above analysis-
1. There are 1004 rows and 6 duplicate rows.
2. There are 39 columns.
3. Removed 6 duplicate rows.
4. Replaced all the missing values in the columns with numerical value 0.
5. The Count total steps for the activity date is 8236.
6. The count of distance is 5 KM for the activity date.
7. Maximum Step count distribution is for ID 0 with 18% distribution.
8. Second highest is for ID 14172 with 9% distribution.
9. Count for sleep record is 33.
10.The highest correlation is between Total distance and Total steps with value
of 0.98.
Strava Dataset Analysis Through SQL

About the Company-


Strava is a popular fitness tracking app and social network primarily used by
runners, cyclists, and other athletes. It allows users to track their workouts, share
their activities with friends, and compete in challenges. Strava tracks various
metrics like distance, pace, elevation, heart rate, and more, enabling users to
monitor their progress and compare their performance with others.
Strava, Inc. is a U.S.-based company that operates a popular fitness tracking and
social networking app primarily for cyclists and runners. Founded in 2009 by Mark
Gainey and Michael Horvath, Strava allows users to record physical activities
using GPS data, analyze performance metrics, and share workouts with a
community of athletes. The platform supports various sports and integrates with
many fitness devices. Strava is known for features like segments, leaderboards, and
challenges, promoting competition and motivation. The company is headquartered
in San Francisco, California.

Tools Used-
1) MySQL Work-Bench 8.0 CE.
2) MySQL Installer Community.
3) MySQL Administrator.
4) Command Prompt.

Columns in dataset (39)-


[ 'Id', 'ActivityDate', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
'LoggedActivitiesDistance', 'VeryActiveDistance_x','ModeratelyActiveDistance_x',
'LightActiveDistance_x','SedentaryActiveDistance_x','VeryActiveMinutes_x',
'FairlyActiveMinutes_x', 'LightlyActiveMinutes_x', 'SedentaryMinutes_x',
'Calories_x', 'ActivityDay_x', 'Calories_y', 'ActivityDay_y',
'SedentaryMinutes_y', 'LightlyActiveMinutes_y', 'FairlyActiveMinutes_y',
'VeryActiveMinutes_y', 'SedentaryActiveDistance_y','LightActiveDistance_y',
'ModeratelyActiveDistance_y','VeryActiveDistance_y', 'ActivityDay', 'StepTotal',
'SleepDay','TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed', 'Date',
'WeightKg', 'WeightPounds', 'Fat', 'BMI', 'IsManualReport', 'LogId' ]

SQL Analysis-
The first analysis here we have done through general SQL query to fetch all the
rows and columns of the dataset.
The analysis shows all the rows and columns for the Strava app dataset which is
fetched by SQL Query- Select * from strava_dataset
The below analysis Show’s the sum for the total distance through the SQL query-
Select sum(TotalDistance) from strava_dataset

The below analysis Show’s the sum for the total steps through the SQL query-
Select sum(TotalSteps) from strava_dataset

The below analysis Show’s the count for unique activity date through the SQL
query-Select distinct ActivityDate from strava_dataset
The below analysis Show’s the sum for the tracker distance through the SQL
query-Select sum(TrackerDistance) from strava_dataset

The below analysis Show’s the sum for the total calories_x through the SQL query-
Select sum(calories_x) from strava_dataset
The below analysis Show’s the sum for the total calories_y through the SQL query-
Select sum(calories_y) from strava_dataset

The below analysis Show’s the sum for the total VeryActiveDistance_x through the
SQL query-Select sum(VeryActiveDistance_x) from strava_dataset
The below analysis Show’s the sum for the total VeryActiveMinutes_x through the
SQL query-Select sum(VeryActiveMinutes_x) from strava_dataset

The below analysis Show’s the sum for the total FairlyActiveMinutes_x through
the SQL query-Select sum(FairlyActiveMinutes_x) from strava_dataset
The below analysis Show’s the sum for the total ModeratelyActiveDistance_x
through the SQL query-Select sum(ModeratelyActiveDistance_x) from
strava_dataset

The below analysis Show’s the sum for the total LightActiveDistance_x through
the SQL query-Select sum(LightActiveDistance_x) from strava_dataset
The below analysis Show’s the sum for the total LightlyActiveMinutes_x through
the SQL query-Select sum(LightlyActiveMinutes_x) from strava_dataset

The below analysis Show’s the sum for the total SedentaryMinutes_x through the
SQL query-Select sum(SedentaryMinutes _x) from strava_dataset
The below analysis Show’s the sum for the total Total Steps by Id through the SQL
query-Select sum(TotalSteps) from strava_dataset group by Id

The below analysis Show’s the sum for the total Total Distance by Id through the
SQL query-Select sum(TotalDistance) from strava_dataset group by Id
The below analysis Show’s the sum for the total calories_x by Id through the SQL
query-Select sum(calories_x) from strava_dataset group by Id

The below SQL analysis shows sum of various variables by using sum function
and displaying results horizontal for each variable.
The below SQL analysis shows sum of various variables by using sum aggregate
function and grouping it by Id and displaying results horizontal for each variable.
ER-Diagram-

Conclusion-
Holistic Dataset Overview:
• We began the analysis by retrieving the entire dataset using the SELECT *
query to understand the available attributes and the data structure.
• The dataset consists of 39 columns covering fitness metrics such as steps,
distance, activity minutes, calories, and sleep patterns.
Total Metrics Calculation:
• Total Steps and Total Distance were summed up to get an idea of
cumulative physical activity recorded.
• Tracker Distance and Logged Activities Distance were also aggregated to
analyze device-recorded activities.
Activity Intensity Insights:
• Metrics were separated into Very Active, Moderately Active, and Light
Active categories based on both distance and minutes.
• Aggregated sums of VeryActiveMinutes_x, FairlyActiveMinutes_x, and
LightlyActiveMinutes_x offered insights into user activity intensity.
Calorie Analysis:
• Total calories burned were calculated from both calories_x and calories_y
columns to understand energy expenditure.
User Behavior Patterns:
• Unique activity dates were identified to understand user engagement over
time.
• Grouping by Id (user-wise analysis) helped uncover individual user behavior
patterns in terms of:
o Total Steps
o Total Distance
o Calories Burned
Sedentary and Sleep Analysis:
• Summed values of SedentaryMinutes_x provided insights into inactive
durations.
• Sleep-related metrics like TotalMinutesAsleep and TotalTimeInBed were
reviewed (though less emphasized in SQL queries).
Strava Dataset Analysis by Power-BI

Tools Used-
The following tools are used in the Power-BI analysis-
1) Microsoft Power-BI Desktop app.
2) Command Prompt.
3) CSV Dataset File.

Functions and Tools Used for Visualization-


The following are the tools used for Power-BI Visualization-
1) Pie chart.
2) Funnel Chart.
3) Scatter Chart.
4) Count Card.
5) Stacked Column Chart.
6) Text Box.
7) Line Chart.

Business Objective-
The business objective for Strava datasets are as follows-
1. To analyze the dataset and find the new trends which attracts more
customers to download the app.
2. To analyze the dataset and find how to attract the users to use subscription
model.
3. To analyze the dataset and find the insights for the activities and analyze the
behavior and pattern based on it.
4. To analyze the dataset and find the choices and feedback of customers while
using the app.
Power-BI Analysis-
The below Strava Power-BI Analysis shows the following insights-
1) Sum of total distance = 197.43
2) Sum of Total Steps = 272K
3) Count for Activity Date (4/12/2016) = 1
4) Count for Log Id = 33
5) For Calories_x 3921 the sum count of total steps is 25K which is highest
among all others.
6) For Calories_x 2220 the sum count of total steps is 6.1K which is Lowest
among all others.
7) Distance 20.399 has highest distribution of 29% with 23k of Total Steps.
8) Distance 20.399 has lowest distribution of 15% with 12k of Total Steps.
The below Strava Power-BI Analysis shows the following insights-
1) The Sum of Fairly Active minutes_x coordinate 50 has sum of Fairly Active
minutes_y coordinate 50 which is highest among all others.
2) The Sum of Fairly Active minutes_x coordinate 2 has sum of Fairly Active
minutes_y coordinate 2 which is lowest among all others.
3) The highest distribution with 100% is for Id 8877689391 with sum of total
distance of 20.40 among top 10 values.
4) The Second highest distribution with 69.22 % is for Id 8053475328 with
sum of total distance of 14.12 among top 10 values.
5) The Third highest distribution with 50.44% is for Id 7007744171 with sum
of total distance of 10.29 among top 10 values.
6) The Lowest distribution with 36.72% is for Id 2320127002 with sum of total
distance of 7.49 among top 10 values.
Conclusion-
The Leaderboard Effect: There is a huge performance gap between casual users
and the most dedicated athletes. The top user in our dataset covered a staggering
20.40 units of distance, nearly 3x more than the 10th most active user. It is a
powerful reminder of what dedication looks like!

Consistency is Key: The analysis revealed a strong positive correlation between


different "Fairly Active Minutes" metrics. This suggests that when users are active,
they are consistently active across different measurement periods. It is not about
sporadic bursts, but sustained effort.

1. Leaderboard Highlights Intense Usage Gap


o The top-performing user covered 20.40 distance units, significantly
outpacing the 10th ranked user with just 7.49 units — highlighting a
notable performance disparity.
2. Consistent Activity Behavior Observed
o A positive correlation between "Fairly Active Minutes (x and y
coordinates)" shows that users tend to maintain a consistent level of
physical activity over time.
3. Peak Day Performance Insight
o On April 12, 2016, users logged a combined 272,000 steps and
197.43 total distance units, showcasing high engagement on a single
day.
4. Activity Intensity Drives Step Count
o Users with higher calorie burn (e.g., 3921 calories) also recorded the
highest step counts (25K steps), indicating a strong link between
exertion level and engagement.
5. Top Distance Contributors are Key Segments
o The top three contributors account for over 50% of total distance
among top users, suggesting a small group of highly active users
significantly influence overall metrics.
6. Engagement Trends Vary by Distance Brackets
o A specific distance category (e.g., 20.399 units) recorded both high
(29%) and low (15%) distributions of step counts, implying user
behavior can vary significantly even within similar distance levels.
7. Valuable for Business Decisions
o These insights can help target high-engagement users, improve
subscription offerings, and enhance user retention strategies by
focusing on consistency, dedication, and high-performance behavior.

You might also like