DeepanshuTraining (2)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

INDUSTRIAL TRAINING REPORT

Automatic Action Recognition Using Deep


Learning

Submitted By
Name: Deepanshu Nirvan
University Roll No. 2000321530044

SUBMITTED TO:

Department of Computer Science & Engineering (AIML)

ABES ENGINEERING COLLEGE

GHAZIABAD

1
DECLARATION

I hereby declare that the Industrial Training Report entitled “ Automatic Action
Recognition Using Deep Learning” is an authentic record of my own work as
requirements of Industrial Training during the period from 15 July 2023 to 30 August 2023
for the award of degree of B.Tech. Computer Science & Engineering (AI & ML), ABES
Engineering College, Ghaziabad, under the guidance of Mr. Vikas Chaudhary.

Signature of Student

Deepanshu Nirvan
2000321530044
Date:…………….

2
CERTIFICATE

This is to certify that Mr Deepanshu Nirvan has completed Industrial Training during the period from 15
July 2023to 30 August 2023 in our Organization / Industry as a Partial Fulfillment of Degree of Bachelor
of Technology in Computer Science & Engineering (AIML). He was trained in the field of Machine
Learning.

Signature of

Mrs. Deepali Dev

HOD (CSE -AIML)

Date : …………….

3
ACKNOWLEDGEMENT

We would like to convey our sincere thanks to Mr Vikas Chaudhary for


giving the motivation, knowledge and support throughout the course of the
project. The continuous support helps in a successful completion of project.
The knowledge provided is very useful for us.

We also like to give a special thanks to the department of CSE-AIML for


giving us the continuous support and opportunities for fulfilling our mini
project.

Signature of student

Deepanshu Nirvan

2000321530044

4
TABLE OF CONTENTS

CHAPTER NO. CHAPTER’S PAGE NO.


NAME

DECLARATION 2

CERTIFICATE 3

ACKNOWLEDGEMENT 4

1 INTRODUCTION 6

2 REVIEW OF 9
LITERATURE AND
FEASIBILITY STUDY

3 PROPOSED 11
METHODOLOGY

4 FUTURE SCOPE AND 14


USECASES

5 CONCLUSION 16

6 REFERENCES 17

5
1. INTRODUCTION

Human action recognition (HAR) is a challenging yet crucial task in computer


vision, with applications ranging from video surveillance and human-computer
interaction to healthcare and sports analytics. The ability to automatically
recognize and interpret human actions from videos holds immense potential for
various domains. Traditionally, HAR has relied on handcrafted features and
machine learning algorithms, which have shown limited success in capturing

the complex temporal and spatial dynamics of human actions. In recent years,
deep learning has emerged as a powerful tool for HAR, revolutionizing the
field with its ability to learn complex representations from raw data.
Convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) have become the dominant architectures for HAR, demonstrating
remarkable performance in various datasets and scenarios. CNNs excel at
extracting spatial features from images or frames of videos, while RNNs
effectively capture temporal dependencies in action sequences.
The combination of CNNs and RNNs has led to the development of
sophisticated HAR architectures, such as two-stream networks and
convolutional long short-term memory (ConvLSTM) networks. These
architectures leverage the strengths of both CNNs and RNNs, achieving
state-of-the-art performance in HAR.
Despite significant advances, HAR remains a challenging task due to variations
in appearance, lighting, and background conditions. Additionally, the
complexity of human actions and the need for real-time processing pose further
challenges. Ongoing research is focused on addressing these challenges and
developing more robust, efficient, and real-time HAR systems.
This paper delves into the application of deep learning for HAR, exploring the
underlying principles, architectures, and recent advancements. We present a
comprehensive overview of deep learning-based HAR methods, highlighting

6
their strengths, limitations, and applications. We also discuss the challenges
and future directions of HAR research.

7
1.1 PROBLEM STATEMENT

Human Action Recognition (HAR) is a fundamental pursuit in computer vision,


playing a pivotal role across diverse sectors such as video surveillance,
healthcare, and sports analytics. Traditionally, HAR relied on manually crafted
features and machine learning techniques, yet struggled to capture the intricate
temporal and spatial dynamics inherent in human actions. However, the advent of
deep learning has revolutionized HAR by harnessing the capabilities of
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs). CNNs adeptly extract spatial features from images or video frames,
while RNNs effectively model temporal dependencies within action sequences.
This synergy has given rise to sophisticated HAR architectures, notably two-
stream networks and Convolutional Long Short-Term Memory (ConvLSTM)
networks, showcasing state-of-the-art performance.

Despite these advancements, HAR encounters challenges due to varying lighting


conditions, diverse appearances, and the demand for real-time processing.
Addressing these hurdles remains a focus of ongoing research, aiming to fortify
HAR systems with robustness and efficiency. This paper delves into the realm of
deep learning for HAR, investigating underlying principles, architectural
intricacies, recent progressions, limitations, and practical applications.
Additionally, it examines the persisting challenges in HAR, outlining avenues for
future research directions to bolster the resilience and real-time capabilities of
HAR systems in varied contexts and environments.

8
2. REVIEW OF LITERATURE

Deep learning, a technology enabling machines to learn patterns from data,


has greatly influenced action recognition in videos. Convolutional Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs) stand as prominent
architectures in this domain. CNNs excel in image understanding, while RNNs
specialize in sequence analysis.

Methods in Action Recognition:


1. CNN-based Approaches:
CNNs, adapted for video analysis, demonstrate proficiency in extracting
spatial and temporal features from frames, enabling detailed understanding of
motion patterns.
Benefits: Efficient at capturing fine-grained details and motion characteristics.
Challenges: High computational demands, limitations in processing
prolonged video sequences.
2. RNN-based Approaches:
RNNs, particularly Long Short-Term Memory (LSTM) or Gated Recurrent
Units (GRUs), excel in modeling temporal dependencies, beneficial for
recognizing actions evolving over time.
Benefits: Proficient in understanding temporal sequences and long-range
dependencies.
Challenges: Complexity in training, difficulty in capturing extensive temporal
contexts.
3. Hybrid Architectures:
Hybrid models, merging CNNs and RNNs, aim to leverage both spatial and
temporal information, achieving comprehensive action understanding.
Benefits: Combined strengths in capturing static and dynamic features.
Challenges: Increased complexity, potential information redundancy, and

9
intricate model optimization.

Challenges and Limitations:


Varied Perspectives: Models encounter difficulty when actions exhibit diverse
viewpoints or get obscured, impacting recognition accuracy.
Environmental Adaptation: Some models struggle with generalizing across
different settings or recognizing novel actions due to limited training data.
Computational Overhead: Deep learning models demand substantial
computational resources, hindering real-time deployment in resource-
constrained scenarios.

10
3. PROPOSED METHODOLOGY
The proposed framework for action recognition constitutes a sophisticated
integration of Convolutional Neural Networks (CNNs) and Long Short-Term
Memory (LSTM) networks, meticulously designed to holistically capture and
process both spatial and temporal features ingrained within video data. This
system aims to elevate the accuracy and efficacy of action recognition by
synergizing the unique strengths of CNNs and LSTMs.

In the realm of visual analysis, CNNs play a foundational role in dissecting the
visual components encapsulated within individual frames of the video. These
networks excel in discerning intricate visual attributes such as shapes, patterns,
and textures. By adeptly identifying and interpreting these essential visual cues,
CNNs pave the way for a nuanced understanding of distinct actions, thereby
laying the groundwork for subsequent recognition processes.

Complementing the visual analysis, LSTM networks contribute their expertise in


comprehending temporal sequences. Renowned for their capacity to retain and
interpret information across sequential frames, LSTMs are instrumental in
decoding the temporal progression of actions within video sequences. Their
capability to understand the sequential order and evolution of actions fortifies the
system's ability to perceive and recognize actions more accurately in a temporal
context.

The fusion of CNNs and LSTMs within this framework fosters a symbiotic
relationship between spatial and temporal comprehension. While CNNs
meticulously capture and encode visual intricacies within each frame, LSTMs
adeptly navigate and analyze these frames in a sequential continuum. This
collaboration empowers the system to synthesize holistic representations of
actions, leveraging both visual information and temporal context for more
nuanced and accurate recognition.

11
Following this comprehensive feature extraction process, the amalgamated
features traverse through a classification layer equipped with a trained algorithm.
This layer harnesses learned patterns to classify observed actions, assigning
probabilities to potential actions and aligning them with predefined labels.
Ultimately, this orchestrated pipeline culminates in refined and precise action
recognition, substantiating the efficacy of this CNN-LSTM amalgamation in
deciphering human actions from video data.

12
13
4. FUTURE SCOPE AND USECASES
Healthcare:In the healthcare sector, action recognition systems using
CNN+LSTM can play a pivotal role in patient monitoring, rehabilitation
assessment, and personalized treatment plans. By analyzing patient
movements and activities, these systems can provide valuable insights into
their physical condition, progress, and potential risks. This information can
be used to tailor rehabilitation exercises, monitor recovery from injuries
or surgeries, and identify early signs of neurological disorders.

Sports Analytics:For athletes and sports enthusiasts, action recognition


systems powered by CNN+LSTM can revolutionize sports analytics and
performance enhancement. By analyzing the movements of athletes during
training and competitions, these systems can provide real-time feedback,
identify areas for improvement, and optimize training strategies. This data-
driven approach can lead to enhanced performance, reduced injury risks,
and improved overall athletic outcomes.

Human-Computer Interaction:Action recognition systems using


CNN+LSTM can transform human-computer interaction by enabling
natural and intuitive interactions between humans and machines. By
understanding human gestures, postures, and movements, these systems
can control devices, navigate interfaces, and respond to user commands
without the need for traditional input methods. This can revolutionize the
way we interact with computers, making them more user-friendly and
accessible.

Surveillance and Security:In surveillance and security applications,


action recognition systems using CNN+LSTM can play a crucial role in
monitoring environments, detecting suspicious behavior, and preventing
unauthorized access. By analyzing the movements of individuals in real-
14
time, these systems can identify potential threats, track individuals of
interest, and alert security personnel. This can enhance security measures
in public spaces, airports, and other critical areas.

Robotics and Automation:Action recognition systems using


CNN+LSTM can empower robots and autonomous systems to interact with
the world in a more human-like manner. By understanding human actions
and intentions, robots can assist humans in various tasks, collaborate in
complex environments, and provide personalized services. This can lead to
advancements in robotics, automation, and human-robot collaboration.

Virtual Reality and Augmented Reality:Action recognition systems


using CNN+LSTM can enhance virtual reality (VR) and augmented reality
(AR) experiences by enabling natural and intuitive interactions within
virtual environments. By tracking user movements and gestures, these
systems can allow users to manipulate virtual objects, navigate virtual
spaces, and interact with virtual avatars. This can create more immersive
and engaging VR/AR experiences.

Accessibility and Assistive Technologies:Action recognition systems using


CNN+LSTM can improve accessibility for individuals with disabilities by
enabling alternative input methods for controlling devices and interacting
with technology. By recognizing gestures, facial expressions, and body
movements, these systems can provide personalized solutions for
communication, navigation, and control. This can enhance the quality of life
for individuals with physical or cognitive limitations.

15
5.CONCLUSION
People are really interested in recognizing actions, like gestures or movements,
because it's useful for lots of things, such as security cameras, how we interact
with computers, or even in healthcare. But there are still many problems that
haven't been fixed yet.
In our study, we looked at different ways people are trying to solve these
problems. One big issue is that actions can look different from different angles
or when things get in the way. Some methods work okay with this, but none are
perfect. Also, when the camera moves or when there are lots of things
happening in the background, it's hard for computers to understand actions.
Some methods try to fix this, but they still have limits. There's hope for new
ways to fix these problems, like making better systems to recognize actions or
creating new sets of data to test these systems. But right now, there's no one
solution that solves all these issues. We need to explore new ideas and areas to
make a system that can handle all these problems. Our study shows the
problems that still need fixing. This can guide researchers to focus on these
problems and hopefully create a system that's really good at recognizing actions
no matter what challenges come up.

16
6. REFERENCES
[1] M. Ryoo and J. Aggarwal, “Human activity analysis: A Review,” ACM
Computing Surveys, Article 16, vol. 43, pp. 16:1 – 16:43, April 2011. [2] D.
Siewiorek, A. Smailagic, and A. Dey, “Architecture and applications of virtual
coaches,” Proceedings of the IEEE, Invited Paper, vol. 100, pp. 2472–2488,
August 2012.
[3] T. Kanade and M. Hebert, “First-person vision,” Proc. of the IEEE, Invited
Paper, vol. 100, pp. 2442–2453, August 2012.
[4] T. Shibata, “Therapeutic Seal robot biofeedback medical device: qualitative
and quantitative evaluations of robot therapy in dementia care,” Proceedings of
the IEEE, Invited Paper, vol. 100, pp. 2527–2538, August 2012.
[5] K. Yamazaki, R. Ueda, S. Nozawa, M. Kojima, K. Okada,
K. Matsumoto, M. Ishikawa, I. Shimoyama, and M. Inaba, “Home-assistant
robot for an aging society,” Proceedings of the IEEE, Invited Paper, vol. 100,
pp. 2429–2441, August 2012.
[6] P. Kelly, A. Healy, K. Moran, and N. E. O’Connor, “A virtual coaching
environment for improving golf swing technique,” in ACM Multimedia
Workshop on Surreal Media and Virtual Cloning, pp. 51 – 56, October 2010.
[7] L. Palafox and H.Hashimoto, “Human action recognition using wavelet
signal analysis as an input in 4W1H,” in IEEE Intl. Conf. on Industrial
Informatics, pp. 679 – 684, July 2010.
[8] R. Poppe, “A survey on vision-based human action recognition,” Image
and Vision Computing, vol. 28, pp. 976
– 990, June 2010.
[9] Y. Li and Y. Kuai, “Action recognition based on spatio- temporal interest
points,” in Intl. Conf. on BioMedical Engineering and Informatics, pp. 181 –
185, October 2012.
[10] X. Ji and H. Liu, “Advances in view-invariant human motion analysis -
A Review,” IEEE Trans. on Systems, Man, and Cybernetic Part C:

17
Applications and Reviews, vol. 40, pp. 13 – 24, January 2012.
[11] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, “The i3DPost
multi-view and 3D human action/interaction database,” in Conf. for Visual
Media Production, pp. 159 – 168, November 2009.
[12] A. F. Bobick and J. W. Davis, “The recognition of human movement
using temporal templates,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 23, pp. 257 – 267, March 2001.
[13] T. Darell and A. Pentland, “Space-time gestures,” in IEEE Conf. on
Computer Vision and Pattern Recognition, pp. 335 – 340, 1993.
[14] G. Rogez, J. Guerrero, and C. Orrite, “View-invariant human feature
extraction for video-surveillance applications,” in IEEE Conf. on Advanced
Video and signal based Surveillance, pp. 324 – 329, 2007.
[15] H. Ragheb, S. Velastin, P. Remagnino, and T. Ellis, “Human action
recognition using robust power spectrum features,” in IEEE Conf. of Image
Processing, pp. 753–756, October 2008.
[16] Y. Lu, Y. Li, Y. Chen, F. Ding, X. Wang, J. Hu, and S. Ding, “A Human
action recognition method based on Tchebichef moment invariants and
temporal templates,” in Intl. Conf. on Intelligent Human-Machine Systems
and Cybernetics, pp. 76–79, August 2012.

18

You might also like