ML Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Revenue Forecasting

for E Commerce Businesses

Utkarsh Rajauria
Amit Chaurasiya
Aadarsh

Department of ECE, IIITD


utkarsh19214@iiitd.ac.in,
aadarsh19131@iiitd.ac.in,
amit19142@iiitd.ac.in

Abstract

Analyzing revenue data is essential for understanding the performance of a company's products and overall
business strategy. By analyzing revenue data, businesses can identify top-performing products, monitor sales
trends, forecast future revenue, and identify underperforming products that may require adjustments to marketing
strategies or product development. This information is valuable for making informed decisions about resource
allocation, inventory management, and overall business planning. We will be computing each of our expectations
with 3 machine learning strategies at present which are linear regression, neural network, and decision tree. This
multitude of strategies will be used to fit every one of the ideal expectations. The models will be tried with
numerous hyperparameters and their exhibition will be examined.

Keywords: Profit and Revenue Prediction, Machine Learning, Decision Tree, Neural Networks,
Linear Regression.

1. INTRODUCTION

1.1 Background

● What is E-commerce, and why do we need it? Why an ML model for forecasting revenue model ?

E-commerce, short for electronic commerce, alludes to the trading of labor and products through
web-based channels, for example, sites, versatile applications, or online entertainment stages.
With the fast development of the web and innovative headways, online business has turned into
an essential piece of present-day business. An ML model for estimating income can give organizations
exact and dependable expectations of future income in view of verifiable information. ML models can give
a more complex and computerized way to deal with income determining, permitting organizations to
pursue information-driven choices and streamline their exhibition.
1.2 Literature Survey

● Revenue forecasting is a critical aspect of managing e-commerce businesses, and numerous studies
have explored various approaches to predicting revenue for online retailers. The following are some of
the surveys:
● One popular approach to revenue forecasting is the use of time series analysis, which involves
analyzing past revenue data to identify patterns and trends and predict future revenue. (Wei, Peng,
Yig, 2014)[3]
● Another approach to revenue forecasting is the use of customer behavior data. By analyzing customer
behavior, such as purchase history, browsing patterns, and social media activity, businesses can
predict future revenue and tailor their marketing strategies accordingly (Patangia, 2020)[2].
● Some studies have also explored the use of predictive analytics, which involves using data mining and
machine learning techniques to identify patterns and trends in large data sets (Yin & Tao, 2021)[3].

1.3 Objectives

Following are the objectives which we are aiming to achieve:


● Performing EDA on various features of the dataset.
● Net revenue and profit of the company for a given year.
● Calculating profit and revenues of following categories of the company:
1. Customer’s gender
2. Country
3. Sub-category feature
4. Product - wise
5. Age - group

1.4 Scope

The scope of revenue forecasting for e-commerce business organizations utilizing ML is immense. ML models
can be utilized to figure income for individual items, and deals channels, like web-based commercial centers,
online entertainment stages, and retail locations. This can assist organizations with enhancing their item
contributions, valuing techniques, and deals and appropriation methodologies and allot assets actually.

1.5 Impact

ML models are equipped for breaking down tremendous measures of information and distinguishing designs that
traditional statistical models may miss. This can prompt more precise income figures, assisting organizations with
pursuing better-informed choices, which assists companies with enhancing their stock levels, diminishing the risk
of stockouts and an overabundance of stock and hence saving expenses.
2. MATERIALS AND METHOD

2.1 Dataset

We will be using a dataset of a company which has reach in multiple countries and sells a good variety of
products. We have around 18 columns in our dataset in which profit and revenue are going to be our target
vectors. We have features like date of purchase, category, sub category of products, customer age group,
quantity purchased , unit cost, country of purchase , sex of customer etc. It has over 1 lakh 10 thousand plus
data points which provides enough data for our model to get trained and also filtering relevant data out for
varied prediction types. One can find the link to our dataset in the reference section.

Link: https://drive.google.com/file/d/1ExtvHACrwaiZ-AxxOz24EBIHN_uxkl_4/view?usp=share_link

2.2 Methodology

We will be extracting and using our dataset using the pandas library in python and all the o
2.2.1 Exploratory data analysis:

We performed the following analysis on our dataset:


● Mean-Mode Standard Deviation, Minimum and Maximum value of each feature & Checking
whether the null values are present or not:

● Checking whether the null values are present or not & finding the correlation matrix:

● Distribution Graphs
● Year v/s Revenue Product v/s Revenue

● Sub-Category v/s Revenue Country v/s Revenue

● Product v/s Revenue


● Age Group v/s Item Sales

● Favorite Category for Men:

● Favorite Category for Women:

● Country- wise Purchasing Power:

● Age v/s Category Scatter Plot: Month-wise Sale of Dataset:


● Analysis after performing the EDA:

1. In order to normalize our dataset we have used the min-max scaler.


2. There were no null values found.
3. After finding the correlation matrix, we removed the highly correlated features.
4. Gender is not an important feature as there was no significant difference found in sales.
5. Bike category has the highest revenue.
6. The highest sales were noticed in December, June and May.
7. Most of the sub categories sales drop for customers aging > 65.

3. Model Selection, Training


We have used the following models on our dataset:
● Linear Regression: The data was splitted into training(70%) and testing(30%). After removing the
redundant features from our dataset, 10 of them were left. Thereafter we applied the linear regression
model. Here are the observations:
RMSE: 56.0017613663

● Decision Tree: We have used the decision tree regressor here to explore for better results compared to the
previous linear regression model. Here we are first dividing our dataset into train(60%),
validation(20%) and test(20%) sets.
RMSE: 17.552048499545403
Now, we perform the Grid Search using GridSearchCV module to find the best hyper parameters for
the regressor which we found out to be :
max_depth : 10
min_samples_leaf : 1
min_samples_split : 2
● Random Forest with K fold cross-validation: We are trying to find a better RMSE but after applying K fold
cross-validation our RMSE gets increased, where the value of k is 5.
Average RMSE: 44.31867454822607

● XGBoost:
Train RMSE error of best model: 8.230586751424866
Test RMSE error of best model: 8.267245545611074

● XGBoost Regressor with PCA:


Train RMSE error of best model: 25.57783189802092
Test RMSE error of best model: 76.49888789902742

We have used Grid Search using the GridSearchCV module to find the best hyper parameters.
Based on the outcome, the XGBoost regressor without PCA has performed significantly
better than the XGBoost regressor with PCA. One of the reasons is that the PCA transformation
is not preserving enough information from the original features, leading to a loss of predictive
power.

● Elastic Net:
Train RMSE error of best model: 1.021254564952723
Test RMSE error of best model: 0.8255877717428844

● Elastic Net with PCA:


Train RMSE error of best model: 0.7208423018024691
Test RMSE error of best model: 0.6139800567406719

As we saw earlier, the LR model was overfitting our data which led to its bad performance.
Hence, to reduce the overfitting issue we have used Elastic Net regularization technique
which is a combination of both L1 and L2 regularization. As showcased in the results of every
model, Elastic Net with PCA has given the best results for our dataset.
● Neural Networks:
RMSE: 0.4944546993327445
In the neural network model there is one hidden layer with 32 neurons and a ReLU activation function, and an
output layer with one neuron and a linear activation function. The model is trained for 100 epochs with a batch size
of 32 using the fit method of the model.
`

2.2.3 Inferences:
In our analysis we found out that linear regression was not able to perform really well with
overfitting being one of the causes. Random forest is able to perform a little better with cross validation applied.
Decision tree further improved the performance with the best hyperparameters found with the Grid Search.
Further we also tested the xgboost Regressor model for trying gradient boosting and found its best hyper
parameters through grid search. Another drastic improvement was found with applying ElasticNet with PCA
where dimensionality was reduced to 5 which reduced the complexity of the model and combination of L1 and L2
regularization techniques were applied to further prevent overfitting and to get a much better fit on test data.
Finally the best performance was found using Neural networks

2.2.2 Fitting the Model

First our dataset will be split in training, validation and test dataset in the percentage of 75%, 15% and
10%, respectively. We will do it using the train_test_split library of sklearn.
We will be calculating all of our predictions with 3 significant machine learning techniques currently
that are linear regression, neural network and decision tree. All these techniques will be utilised to fit for
all the desired predictions. The models will be tested with multiple hyperparameters and their
performance will be analysed.

2.2.3 Post prediction analysis

We will be analysing the performance of each of these algorithms with multiple hyperparameters and
their individual best performances will then be compared to each other. The algorithm with the best
performance will be considered for plotting our predictions. We will be analysing our predictions using
the R^2 matrix and classification_report library from sklearn. At last we will be having prediction plots
for net profit and revenue for the year, country wise profit and revenue predictions etc.

2.2.4 Novelty

Our dataset contains various features, for example, the client's age group, Nation, State, various items,
and their sub-items categories and their expenses sifted by date, month, and year. We will be predicting
the revenue and profit of these particular elements making our results more flexible, precise, and definite.
It will in this way help E-commerce organizations and associations to get a nitty gritty outline of their
income and the fields they need to deal with.

3. CITATIONS
[1] Hsieh, P. H. (2019). A Study of Models for Forecasting E-Commerce Sales During a Price War
in the Medical Product Industry. HCI in Business, Government and Organizations. ECommerce and
Consumer Behavior, 3–21. https://doi.org/10.1007/978-3-030-22335-9_1

[2] Soham Patangia. (2020). Sales Prediction of Market using Machine Learning. International
Journal of Engineering Research And, V9(09). https://doi.org/10.17577/ijertv9is090345

[3] Wei, D., Geng, P., Ying, L., & Shuaipeng, L. (2014, May). A prediction study on e-commerce
sales based on structure time series model and web search data. The 26th Chinese Control and
Decision Conference (2014 CCDC). https://doi.org/10.1109/ccdc.2014.6852219.

You might also like