Internshippresentation 230414184008 11879a25

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

SARVAJANIK COLLEGE OF ENGINEERING AND TECHNOLOGY

INFORMATION AND TECHNOLOGY DEPARTMENT

B. E - VII, IT, SEM - 8


(Term : EVEN - 2022-23)

Presentation
on
“DATA ANALYTICS INTERN”

Subject Name : Internship (3181601)

Prepared and Presented by


Anuj Vaghani (Enrollment No : 190420116070)

Guided by
Prof. Apurva Bharat Mandalaywala
PRESENTATION OUTLINES
● Introduction
● Learning Data Science With Python - Libraries
● Methodology
● Machine Learning
● Outline of work
● Future Work
● References
● Conclusion
Introduction
● Background of the Internship and Company

My name is Anuj Vaghani and I am currently interning at Devotee Infotech Private Limited. During
my internship, I have been working with the data analytics and machine learning teams, and have
gained valuable insights into how these technologies can drive business success .

● Definition of Data Analytics and Machine Learning

Data analytics is the process of analysing and interpreting large sets of data to extract insights and
make informed decisions. Machine learning, on the other hand, is a subset of artificial intelligence that
enables computer systems to learn from data and improve their performance over time.
Introduction
● Objective of Presentation
The objective of this presentation is to provide an overview of data analytics and machine learning, and
their importance in today's business landscape. I will discuss the key concepts, processes, and
techniques involved in data analytics and machine learning, as well as their applications in various
industries. Additionally, I will share my experiences working with the data analytics and machine
learning teams at Devotee Infotech Private Limited, and provide recommendations for how these
technologies can be leveraged to drive business success.
Learning Data Science With Python - Libraries

NumPy is a powerful library for numerical computing in Pandas is a library for data manipulation and analysis. It
Python. It provides support for multi-dimensional arrays, provides tools for working with structured data, such as
mathematical functions, and operations on arrays. Some of data frames and series, and supports a wide range of data
the key topics that will be covered in this section include: formats. Some of the key topics that will be covered in
• Basics of NumPy arrays this section include:
• Basics of Pandas data frames and series
• Array operations and calculations
• Data manipulation and cleaning
• Mathematical functions and operations
• Data aggregation and summarization
• Random number generation
• Merging and joining data frames
Learning Data Science With Python - Libraries
Matplotlib Seaborn
Matplotlib is a powerful library for data visualization in Seaborn is a library for data visualization that is built
Python. It provides support for creating a wide range of on top of Matplotlib. It provides a higher-level
charts and plots, including line charts, scatter plots, interface for creating sophisticated and aesthetically
histograms, and heatmaps. Some of the key topics that will pleasing visualizations. Some of the key topics that
be covered in this section include: will be covered in this section include:

• Basics of Matplotlib charts and plots • Basics of Seaborn charts and plots
• Customizing Seaborn visualizations
• Customizing charts and plots
• Adding labels and annotations • Creating complex visualizations, such as heatmaps and
• violin plots
• Creating subplots and multiple charts
• Visualizing relationships between variables
Methodology
There is seven steps show in below for methodology
(1) Introduction to Methodology
In order to effectively utilize data analytics and machine learning during my internship, I followed a
specific methodology to guide my work.
(2) Define Problem Statement
The first step was to clearly define the problem statement or objective that I wanted to achieve. This
involved identifying the business problem or opportunity and specifying the data sources and variables
that were relevant to the problem.
(3) Data Collection and Preparation
The next step was to collect and prepare the data for analysis. This involved identifying the relevant data
sources and extracting the data, cleaning and transforming the data, and ensuring that the data was ready
for analysis.
(4) Exploratory Data Analysis (EDA)
The third step was to conduct exploratory data analysis (EDA) to gain a better understanding of the data
and identify any patterns or anomalies. This involved using various statistical and visualization techniques
to explore the data and gain insights.
Methodology
(5) Feature Selection and Engineering
The next step was to select and engineer the features that would be used for machine learning. This
involved identifying the relevant features and engineering them to improve their predictive power.
(6) Model Selection and Training
The next step was to select the appropriate machine learning model and train it on the prepared data.
This involved selecting the right algorithm, tuning the hyperparameters, and training the model using a
variety of techniques.
(7) Model Evaluation and Deployment
The final step was to evaluate the performance of the machine learning model and deploy it for use in the
real world. This involved evaluating the model's accuracy and performance, testing it on new data, and
deploying it in a way that could be easily integrated into the business process.
Machine Learning
Types of Machine Learning
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
• Deep Learning
Machine Learning
SUPERVISED LEARNING
• Common Algorithms: Linear Regression, Logistic Regression, Decision Trees, Random Forests,
Support Vector Machines, Naive Bayes, k-Nearest Neighbours.
• Applications in Industry, such as Fraud Detection, Demand Forecasting, and Image Recognition
• Demo: Building a Regression Model to Predict Housing Prices Based on Features such as
Location, Square Footage, and Number of Bedrooms/Bathrooms.
UNSUPERVISED LEARNING
• Common Algorithms: k-Means Clustering, Hierarchical Clustering, Principal Component Analysis,
t-SNE .
• Applications in Industry, such as Customer Segmentation and Anomaly Detection
• Demo: Using k-Means Clustering to Segment Customer Data and Identify Groups with Similar
Behaviour's and Characteristics.
Outline of work
During my internship, I created two major projects that covered all
the concepts of technology
(1) HOTEL BOOKING ANALYSIS
The purpose of this project was to analyse hotel booking data and gain insights into the factors that influence
hotel booking cancellations. The data was collected from a publicly available dataset on Kaggle, and the
analysis was performed using Python and its data analytics libraries. The project involved data cleaning and
pre-processing, exploratory data analysis, and data visualization to gain insights into the patterns and trends
in the data. The results of the analysis provide insights into the key factors that contribute to hotel booking
cancellations and offer recommendations to hotel operators to reduce cancellations and optimize revenue.
Outline of work
• Identify Relevant Data Sources: Hotel Booking Dataset from Kaggle
• Collect Data and Store it in a Structured Manner
• Perform Data Cleaning and Pre-processing
• Exploratory Data Analysis B. Data Modelling and Analysis
• Determine Relevant Variables and Features: Guest Demographics, Booking Details, Hotel
Information, etc.
• Choose Appropriate Modelling Techniques: Linear Regression, Random Forest Regression, etc.
• Evaluation of Models: Mean Squared Error, R-Squared Value, etc. C. Results and Conclusion
• Insights Gained from Analysis: Key Drivers of Booking Cancellations, Popular Booking Channels,
etc.
• Potential Business Applications: Improve Booking Experience, Optimize Hotel Inventory
Management, etc.
Outline of work

OBSERVATION

'Direct' and 'Online


TA' are contributing
the most in both types
of hotels. Aviation
segment should focus
on increasing the
bookings of 'City
Hotel’.
Outline of work
In conclusion, this project has successfully analysed the hotel booking dataset and
provided insights into the factors that contribute to booking cancellations. The analysis
revealed that the lead time between booking and arrival, the type of booking, and the
deposit type were the most significant factors contributing to cancellations. Moreover, it
was found that customers who book through online travel agencies are more likely to
cancel their bookings than those who book directly through the hotel website. The results
of this analysis can help hotel operators to optimize their booking processes, improve
customer experience, and reduce cancellations, ultimately leading to improved revenue
and profitability. Further research could be conducted to explore additional factors that
may influence hotel booking cancellations and to develop predictive models that can help
to forecast cancellations and adjust booking policies accordingly.
Outline of work
During my internship, I created two major projects that covered all
the concepts of technology
(2) Bike Sharing Demand Prediction Project :
The purpose of this project was to predict the demand for bike sharing services based on historical data
using regression techniques. The data was collected from a publicly available dataset on Kaggle and the
analysis was performed using Python and its machine learning libraries. The project involved data cleaning
and pre-processing, feature engineering, model selection, and evaluation to develop an accurate regression
model that can predict bike demand. The results of the analysis provide insights into the key factors that
influence bike demand and offer a predictive model that can help bike-sharing operators to optimize their
service and improve customer satisfaction.
Outline of work
• Using different algorithms gave me different accuracy
Linear Regression

Looks like our r2 score value is 0.77


that means our model is able to
capture most of the data variance.
Let's save it in a data frame for
later comparisons.
Outline of work
LASSO REGRESSION ( L1 REGULARIZATION )

Looks like our r2 score value is 0.40


that means our model is not able to
capture most of the data variance. Let's
save it in a Data Frame for later
comparisons
Outline of work
RIDGE REGRESSION ( L2 REGULARIZATION )

Looks like our r2 score value is 0.77


that means our model is able to
capture most of the data variance.
Let's save it in a data frame for later
comparisons.
Outline of work
ELASTIC NET REGRESSION

Looks like our r2 score value is 0.62


that means our model is able to
capture most of the data variance.
Let's save it in a data frame for later
comparisons.
Outline of work
• No overfitting is seen.
• Random forest Regressor and Gradient Boosting
GridSearchCV gives the highest R2 score of 99%
and 95% respectively for Train Set and 92% for Test
set.
• Feature Importance value for Random Forest and
Gradient Boost are different.
• We can deploy this model.
Future Work
DEEP LEARNING : Deep learning is a subfield of machine learning that uses artificial neural
networks to model and solve complex problems. With the increasing availability of large-scale data sets
and powerful computing resources, deep learning has the potential to transform many industries and
solve some of the world's most pressing problems.

REINFORCEMENT LEARNING : Reinforcement learning is a type of machine learning that focuses


on learning through trial and error. This approach has been used to develop sophisticated game-playing
algorithms, but has the potential to be applied to a wide range of fields, including robotics, finance, and
healthcare.

DATA VISUALIZATION : Data visualization is the art of communicating complex data through
visual representations. Future work in this area could focus on developing new techniques for
visualizing large-scale and high-dimensional data sets, as well as exploring the use of augmented and
virtual reality for data visualization.
References
Kaggle: https://www.kaggle.com/competitions

Google AI Residency Program: https://ai.google/education/research/ai-residency/

Microsoft AI Residency Program: https://www.microsoft.com/en-us/research/academic-program/microsoft-ai

IBM Data Science Elite Team: https://www.ibm.com/analytics/data-science-elite-team

Amazon Machine Learning Internship: https://www.amazon.jobs/en/teams/internships-for-students-machine-learning

Kaggle: Machine Learning Tutorials: https://www.kaggle.com/learn/machine-learning

Fast.Ai: Practical Deep Learning for Coders: https://course.fast.ai/


Conclusion
Ø Machine learning and data analytics have become essential tools for businesses and organizations of all
sizes and industries. They allow us to extract insights and knowledge from vast amounts of data, automate
repetitive tasks, and make better-informed decisions based on data-driven evidence.
Ø I have had the opportunity to work with a variety of machine learning algorithms and data analytics tools,
such as Python, TensorFlow, and Tableau. I have learned how to pre-process data, train models, and
visualize results, and I have gained a deep appreciation for the power and complexity of these
technologies.
Ø Overall, I am grateful for the opportunity to have worked on real-world projects in machine learning and
data analytics, and I am excited to continue learning and growing in these fields. Thank you for your
attention, and I am happy to answer any questions you may have.
Thank
You !!

You might also like