Internship Report Poorab
Internship Report Poorab
Internship Report Poorab
A Project Report
Submitted by:
of
BACHELOR OF TECHNOLOGY
IN
2024
1
CANDIDATE’S DECLARATION
I hereby declare that the work, which is being presented in the summer training report, entitled
“AI/ML INTERNSHIP” in partial fulfillment for the award of Degree of “Bachelor of Technology” in
Department of Computer Science Engineering, and submitted to the Department of Computer Science and
Engineering, Amity School of Engineering & Technology, Amity University, Rajasthan. This is a record of
my own training prepared under the guidance of
Name of Guide: -
2
3
ACKNOWLEDGEMENT
History of all great works is to witness that no great work was ever done without the active or passive
support, a person's surroundings and one’s close quarters. Thus it is not hard to conclude how active
assistance from seniors could prohibitively impact the execution of this report I am highly thankful to our
learned faculty for his active guidance throughout the completion of project report. Last but not least, I
would also want to extend my appreciation to those who could not be mentioned here but well played their
role to inspire and guide me through to the completion of my project report.
SEC-B
4
TABLE OF CONTENTS
Title Page
No.
Abstract 6
Learning Objectives 7
Chapter 6: Conclusion 29
Bibliography 30
5
ABSTRACT
Artificial Intelligence and Machine Learning are revolutionizing industries by enabling systems to perform
tasks that typically require human intelligence, such as problem-solving, decision-making, and pattern
recognition. This report delves into the concepts, tools, methodologies and the projects I did to gain certain
knowledge about AI/ML.
This report also explores the real-world applications of AI/ML like Breast Cancer Detection and Mail Spam
Detection. Moreover, it addresses ethical considerations, such as algorithmic bias and data privacy which
are critical for ensuring responsible AI development.
Model Training- It is the process of optimizing a machine learning algorithm on a dataset to find
patterns or outputs. The resulting function is called the trained machine learning model.
Deployment- It is the process of making a trained ML model available for use in a production
environment. This process involves integrating the model into an existing system or application so
that it can make predictions based on new data.
Deep learning- It is a type of artificial intelligence that uses artificial neural networks to teach
computers how to process data and solve complex problems.
Monitoring and Management- The practice of continuously tracking and evaluating the
performance of a deployed model in production environments, identifying potential issues like data
drift, model degradation, or bias, and taking corrective actions to ensure the model remains accurate
and reliable over time.
Natural language Processing- It is a machine learning technology that gives computers the ability
to interpret, manipulate, and comprehend human language.
This report involves various aspects of artificial intelligence and machine learning and how the projects
helped me in gaining knowledge about certain algorithms with which we can deploy a machine learning
project. I used different libraries of python like pandas, matplotlib, NumPy, seaborn etc., to make my
projects. Apart from that, correct analysis of the given data set is also very important.
In an era defined by data-driven decision-making, AI/ML continues to push the boundaries of innovation,
offering solutions to some of humanity's most pressing challenges while raising important questions about
ethics, governance, and societal impact. This report aims to serve as a comprehensive guide for those
seeking to understand and contribute to this rapidly evolving domain.
6
LEARNING OBJECTIVES
The internship on AI/ML was a great opportunity and a learning experience as this helped me gain deep
knowledge about the same and this will help me even in future.
1. Understand the Fundamentals of AI/ML
Grasp the core concepts of Artificial Intelligence (AI) and Machine Learning (ML).
Learn the differences between supervised, unsupervised, and reinforcement learning.
Understand key algorithms like regression, classification, clustering, and neural networks.
7
CHAPTER 1
INTRODUCTION TO AI/ML
ARTIFICIAL INTELLIGENCE
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are
designed to think, reason, and solve problems like humans.
Reasoning and Decision-Making: Using logic to make informed decisions (e.g., fraud
detection systems).
8
MACHINE LEARNING
Machine learning is a subset of artificial intelligence. It focuses on the using data and algorithms to enable
AI to imitate the way that humans learn, gradually improving its accuracy.
According to Arthur Samuel (1959), Machine learning is a field of study that gives computers the ability to
learn without being explicitly programmed.
2. UNSUPERVISED LEARNING: Data comes with input x but not output label y. Algorithm has to
find structure or pattern in the data. We call it unsupervised learning because we are not trying to
supervise the algorithm to give some right answer for every input instead we asked it to figure out
on its own. An unsupervised learning algorithm might decide that the data can be assigned to two
different groups or clusters. This is also called a clustering algorithm because it places unlabeled
data into different clusters and it turns out to be used in many applications.
9
Types of Unsupervised Learning:
Clustering: The clustering algorithm is defined because it places unlabeled data into different
clusters and it turns out to be used in many applications.
Example- Clustering is used in Google news. Clustering algorithm is finding articles of all 100’s
and 1000’s of news articles on internet that day and is finding the articles that mention similar
words and grouping them into clusters. The algorithm also figures out on its own without
suggestion that which words suggest the certain articles are in same group , means there is no
employee in Google news which tells the algorithm to find articles with respect to certain words.
Therefore , clustering algorithm is a type of unsupervised learning.
Anomaly Detection: Used to find unusual data points and it’s very important for fraud detection in
financial system where unusual transactions can be signs of fraud and for many other applications.
Dimensionality reduction: Compress data using fewer numbers.
3. REINFORCEMENT LEARNING
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to
maximize cumulative rewards in a given situation. Unlike supervised learning, which relies on a
training dataset with predefined answers, RL involves learning through experience. In RL, an agent
learns to achieve a goal in an uncertain, potentially complex environment by performing actions and
receiving feedback through rewards or penalties.
10
Applications of AI/ML
1. Healthcare:
Early disease diagnosis using medical imaging.
Personalized treatment plans through predictive analytics.
2. Finance:
Fraud detection and algorithmic trading.
Credit scoring and risk assessment.
3. Transportation:
Route optimization and autonomous vehicles.
4. Education:
AI-driven adaptive learning platforms for personalized education.
5. Retail:
Inventory management and demand forecasting.
Recommendation engines for e-commerce.
Advantages of AI/ML
1. Bias in AI Models:
Training models on biased datasets can result in discriminatory outcomes.
2. Data Privacy:
Ensuring user data is protected in AI applications.
3. Explainability:
Understanding how complex models like deep neural networks make decisions.
4. Job Displacement:
Automating roles traditionally performed by humans can lead to workforce disruption.
11
CHAPTER 2
METHODOLOGIES USED
The methodologies in AI/ML projects encompass a systematic approach to handle data, select models,
and evaluate performance.
12
FEATURE ENGINEERING
Feature engineering is crucial for improving model performance by identifying and creating the most
relevant features.
1. Feature Selection:
Using correlation matrices to identify highly correlated variables.
Eliminating irrelevant or redundant features to reduce dimensionality.
2. Dimensionality Reduction:
Applying Principal Component Analysis (PCA) or t-SNE to reduce the number of features while
preserving essential information.
3. Feature Scaling:
Normalizing features to fit within a specific range (e.g., 0 to 1).
Standardizing data to have a mean of 0 and a standard deviation of 1.
4. Feature Creation:
Generating new features by combining or transforming existing ones. For instance, creating an
"Age Group" feature from continuous "Age" data.
13
GridSearchCV for exhaustive search.
RandomizedSearchCV for faster tuning.
4. Regularization:
Applying techniques like L1 (Lasso) and L2 (Ridge) to avoid overfitting.
MODEL EVALUATION
After training, the model’s performance is evaluated using various metrics.
1. Evaluation Metrics:
For Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC Curve.
For Regression: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared
Error (RMSE), R2R^2R2.
For Clustering: Silhouette Score, Davies-Bouldin Index.
2. Validation:
Using cross-validation techniques like k-fold cross-validation to ensure consistent performance
across different data subsets.
3. Error Analysis:
Analysing the errors or misclassifications to identify model weaknesses.
Using confusion matrices for classification problems to understand false positives/negatives.
4. Comparison:
Comparing multiple models to choose the one with the best performance.
Plotting learning curves to analyse training and validation performance over epochs.
MODEL DEPLOYMENT
If the project involves real-world application, the trained model is deployed.
1. Saving the Model:
Using frameworks like TensorFlow or scikit-learn to export the model in formats such as .h5 or .pkl.
2. Deployment Tools:
Deploying the model using Flask/Django APIs.
Cloud services like AWS SageMaker, Google AI Platform, or Azure ML.
3. Integration: Connecting the model with front-end interfaces or mobile applications for interaction.
14
CHAPTER -3
TOOLS AND FRAMEWORKS USED
AI/ML projects require a combination of programming languages, libraries, frameworks, and tools for data
preprocessing, model building, evaluation, and deployment.
PYTHON:
Primary language used for all tasks in the internship.
Features extensive libraries like NumPy, pandas, and Matplotlib for data analysis and visualization.
Easy-to-use syntax and compatibility with popular AI/ML frameworks.
R:
Occasionally used for statistical analysis and visualization.
Preferred for exploratory data analysis in specific projects.
scikit-learn (sklearn):
Essential for implementing traditional machine learning algorithms like Logistic Regression,
Decision Trees, and Random Forests.
Tools for preprocessing (e.g., StandardScaler, OneHotEncoder), feature selection, and
hyperparameter tuning (e.g., GridSearchCV).
Easy-to-integrate metrics like accuracy, precision, and recall.
1. TensorFlow:
Used for creating deep learning models, particularly Convolutional Neural Networks (CNNs) for
image recognition tasks.
Features the Keras API for simplified model building.
Includes TensorBoard for real-time visualization of training metrics.
15
2. PyTorch:
Preferred for research-oriented tasks due to its flexibility and dynamic computation graph.
Used to build Recurrent Neural Networks (RNNs) for Natural Language Processing (NLP) tasks.
Compatible with deployment frameworks like TorchServe.
1. TensorBoard:
Visualized loss and accuracy metrics during training.
Analyzed network structures and gradients.
2. PyCharm/Jupyter Notebooks:
PyCharm: IDE used for debugging and writing modular Python code.
Jupyter Notebooks: Interactive environment for running code cells and visualizing outputs
simultaneously.
17
CHAPTER -4
INTERNSHIP DISCUSSION
This internship taught a lot about AI/ML and many aspects related to this domain. This domain has a
great demand in the near future and in order to learn deeply about this domain , we need to work more
on real-time projects and in these real-time projects there are many types of libraries , programming
languages and frameworks used in order to build and deploy an AI model.
19
Professional Skills:
Collaboration:
Worked in teams with domain experts, and software developers.
Participated in brainstorming sessions to identify AI-driven solutions for real-world problems.
Enrolled in a machine learning course which taught totally new things about the domain and it
helped me in building my projects.
Developers and users of AI systems should be held accountable for the decisions made by their
algorithms.
AI systems should be developed and used in a way that respects user privacy and protects sensitive
data.
Communication:
Presented project results through dashboards, reports, and visualizations.
I even gave presentation about the topic near the team and that improved my communication skills.
Learned to communicate with a lot of people working in my team and this boosted my confidence.
Learned to explain technical concepts to non-technical stakeholders.
When I faced queries regarding the project, I used to clarify my doubts and there was no such
communication barrier which lagged me behind.
Time Management:
Managed deadlines for project milestones and deliverables efficiently.
Balanced learning new tools while meeting project requirements.
Keeping everything organized and managing the project deadline was a challenge for me but still I
managed to keep everything in track.
20
RESULTS/OBSERVATIONS/ WORK EXPERIENCES GET IN
INTERNSHIP COMPANY
1. RESULTS ACHIEVED :
Developed a breast cancer detection model using the provided dataset, achieving 90% accuracy.
Built a house price predictor using regression analysis, which improved targeted marketing
strategies by 40%.
Designed a mail spam detector using the given dataset and this gave a great accurate result.
Developed an IPL score predictor using Logistic Regression and this helped me achieve 92%
accuracy.
I even got certified for my contribution towards the project and this boosted me up.
Developed robust data cleaning pipelines to handle missing values, outliers, and inconsistencies.
Implemented data imputation techniques to fill in missing data effectively.
Engineered new features to improve model performance and explainability.
Experiment with various ML algorithms (e.g., decision trees, random forests, support vector
machines, neural networks) to identify the most suitable model for the task.
Fine-tuned hyperparameters to optimize model performance.
Implemented techniques like regularization and early stopping to prevent overfitting.
Evaluated model performance using relevant metrics (e.g., accuracy, precision, recall, F1-score,
AUC-ROC) and visualized results.
Deployed models into production environments, such as cloud platforms or web applications.
Monitored model performance in real-world settings and retrained as needed.
Collaborated with cross-functional teams (e.g., software engineers, product managers) to ensure
smooth project execution.
Contributed to a positive and collaborative work environment.
21
WORK EXPERIENCES
Data Cleaning and Preprocessing:
o Handled missing data using techniques like imputation or deletion.
o Addressed outliers and anomalies in the data.
o Normalized and standardized numerical features.
o Engineered new features to improve model performance.
Feature Engineering:
o Extracted relevant features from raw data.
o Created feature interactions and transformations.
o Selected optimal features using techniques like feature importance and correlation analysis.
Data Visualization:
o Used libraries like Matplotlib, Seaborn etc. to visualize data distributions, trends, and
patterns.
o Created interactive visualizations to explore data insights.
Model Selection:
o Evaluated different ML algorithms (e.g., linear regression, logistic regression, decision
trees, random forests, support vector machines, neural networks) 1 for suitability to the
problem.
o Considered factors like model complexity, interpretability, and performance metrics.
Hyperparameter Tuning:
o Used techniques like grid search, random search, or Bayesian optimization to find optimal
hyperparameters.
o Monitored training progress and adjusted hyperparameters as needed.
Model Training:
o Implemented training pipelines using frameworks like TensorFlow or PyTorch.
o Addressed overfitting and underfitting issues through regularization techniques.
o Utilized techniques like early stopping to prevent overtraining.
Model Evaluation:
o Assessed model performance using appropriate metrics (e.g., accuracy, precision, recall,
F1-score, AUC-ROC).
o Created confusion matrices and ROC curves to visualize model performance.
Model Deployment:
o Deployed models to production environments (e.g., cloud platforms, web applications,
mobile apps).
o Implemented monitoring systems to track model performance and detect degradation.
o Retrained models periodically to adapt to changes in data distribution or requirements.
Team Meetings and Discussions:
o Actively participated in team meetings to discuss project progress, challenges, and
solutions.
o Collaborated with team members to brainstorm ideas and share knowledge.
Code Reviews and Feedback:
o Conducted code reviews to improve code quality, readability, and efficiency.
o Incorporated feedback from team members to refine code and algorithms.
22
Version Control:
o Used Git to manage code versions and collaborate effectively with team members.
Novel Approaches:
o Explored innovative techniques and algorithms to address challenging problems.
Data-Driven Insights:
o Uncovered valuable insights from data to inform business decisions.
Prototyping and Experimentation:
o Rapidly prototyped and iterated on ML models.
Collaboration and Knowledge Sharing:
o Worked collaboratively with team members to share knowledge and solve problems.
23
CHAPTER -5
PROJECT DETAILS
Dataset Description
For the diabetes prediction project , I got a dataset in a .csv file and that dataset consisted of the columns
like Pregnancies, Glucose, Blood Pressure , Insulin , BMI, Age , Diabetes Pedigree Function, Outcome.
This dataset basically gave a rough idea of what is highest stage of Diabetes and what is the lowest stage of
diabetes with some specific values in rows . I used those values in order to predict that how diabetes
could be more and more persistent in future and in that project I got 91% accuracy which shows that the
analysis done by me for the particular dataset was somewhat correct.
1. Classification Metrics
These metrics are used when the output variable is categorical.
Accuracy:
24
Definition: The ratio of correctly predicted instances to the total instances.
Formula: Accuracy=TP+TNTP+TN+FP+FNAccuracy=TP+TN+FP+FNTP+TN
Use Case: Good for balanced datasets but can be misleading if classes are imbalanced.
Precision:
Definition: The ratio of true positive predictions to the total predicted positives.
Formula: Precision=TPTP+FPPrecision=TP+FPTP
Use Case: Important in scenarios where false positives are costly (e.g., spam detection).
Recall (Sensitivity):
Definition: The ratio of true positive predictions to the total actual positives.
Formula: Recall=TPTP+FNRecall=TP+FNTP
Use Case: Crucial when missing a positive instance is costly (e.g., disease detection).
F1 Score:
Definition: The harmonic mean of precision and recall, providing a balance between the two.
Formula:
F1=2×Precision×RecallPrecision+RecallF1=2×Precision+RecallPrecision×Recall
Use Case: Useful in imbalanced datasets where both false positives and false negatives
matter.
ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
Definition: A graphical representation of a model's diagnostic ability across various
threshold settings.
Interpretation: AUC value ranges from 0 to 1, where a value closer to 1 indicates better
performance.
Use Case: Helps compare classifiers and understand trade-offs between true positive rates
and false positive rates.
25
2. Regression Metrics
These metrics are used when the output variable is continuous.
Mean Absolute Error (MAE):
Definition: The average of absolute differences between predicted and actual values.
Formula: MAE=1n∑i=1n∣yi−y^i∣MAE=n1i=1∑n∣yi−y^i∣
Use Case: Provides a straightforward interpretation of prediction error.
Root Mean Squared Error (RMSE):
Definition: The square root of MSE, providing error in the same units as the target variable.
Formula: RMSE=MSERMSE=MSE
Use Case: Commonly used to interpret model accuracy in practical terms.
R-squared (R2R2):
Definition: A statistical measure that represents the proportion of variance for a dependent
variable that's explained by an independent variable(s).
Interpretation: Ranges from 0 to 1, with higher values indicating better model fit.
Use Case: Useful for understanding how well independent variables explain the variability
of the dependent variable.
26
3. Clustering Metrics
Metrics used to evaluate clustering algorithms.
Silhouette Score:
Definition: A measure of how similar an object is to its own cluster compared to other
clusters.
Range: Values range from -1 to +1; higher values indicate better-defined clusters.
Davies-Bouldin Index:
Definition: A ratio of within-cluster distances to between-cluster distances, lower values
indicate better clustering.
27
identify patterns or common characteristics among misclassified examples.
Root Cause Analysis: Investigate potential reasons for errors, such as data quality issues, feature
selection problems, or model complexity. This can lead to actionable insights for model refinement.
4. Feature Importance Assessment
Evaluating Feature Contributions: Use techniques like permutation importance or SHAP values
to assess the contribution of each feature to the model's predictions. This helps in understanding
which features drive model decisions and may guide future feature engineering efforts.
Dimensionality Reduction Insights: If dimensionality reduction techniques (like PCA) were used,
analyze how features were transformed and which components contribute most to variance.
5. Model Robustness Testing
Testing Under Adverse Conditions: Evaluate how well the model performs under various
conditions, such as noisy data or out-of-distribution samples. This helps assess the robustness and
generalizability of the model.
Cross-Validation Results: Review cross-validation results to ensure that the model performs
consistently across different subsets of data.
6. Visualization of Results
Graphical Representations: Utilize visualizations such as ROC curves, precision-recall curves,
and learning curves to provide intuitive insights into model performance.
Data Distribution Visuals: Create plots (e.g., histograms or box plots) to visualize data
distributions before and after modeling, highlighting shifts that may impact results.
7. User Feedback Integration
Incorporating User Insights: Gather feedback from end-users regarding model predictions and
usability. This qualitative data can provide context that quantitative metrics may overlook.
Iterative Improvement Process: Use user feedback as part of an iterative process for continuous
improvement of the model.
8. Deployment Considerations
Real-World Performance Monitoring: Once deployed, set up systems to monitor the model's
performance in real time against live data. This helps identify drift or degradation over time.
Scalability Assessment: Analyze how well the model scales with increased data volume or user
load, ensuring it remains efficient and effective as usage grows
28
CHAPTER-5
CONCLUSION
The field of Artificial Intelligence (AI) and Machine Learning (ML) has emerged as a
processes and operational efficiencies. This report has explored the fundamental concepts of
AI/ML, including their methodologies, applications, and the underlying algorithms that drive
their functionality. Through comprehensive analysis, it is evident that AI/ML technologies are
not only reshaping traditional business models but also creating new avenues for
innovation.One of the key findings of this report is the importance of data quality in training
effective AI/ML models. Clean, well-structured data is essential for achieving high accuracy
and reliability in predictions. Additionally, the report highlights the necessity of employing
and F1 score for classification problems, or MAE and RMSE for regression tasks—to ensure a
AI/ML deployment cannot be overstated. Issues such as data privacy, algorithmic bias, and
transparency must be addressed to foster trust among users and stakeholders. The report
emphasizes the need for continuous monitoring and evaluation of deployed models to mitigate
risks associated with model drift and ensure sustained performance over time.In conclusion,
while AI/ML technologies hold immense potential for driving progress and efficiency, their
ongoing research and development will be crucial to harness their full capabilities while
https://www.ibm.com/topics/machine-learning
https://www.ibm.com/artificial-intelligence
Methodologies used –
https://www.javatpoint.com/machine-learning-techniques
Project details-
https://colab.research.google.com/drive/1Cz5hf5mncvSJIFwrFvhFNaOvImDvqQmV
GitHub link –
https://github.com/PoorabSumanth1234/2506myprojects
30