DeepanshuTraining (2)
DeepanshuTraining (2)
DeepanshuTraining (2)
Submitted By
Name: Deepanshu Nirvan
University Roll No. 2000321530044
SUBMITTED TO:
GHAZIABAD
1
DECLARATION
I hereby declare that the Industrial Training Report entitled “ Automatic Action
Recognition Using Deep Learning” is an authentic record of my own work as
requirements of Industrial Training during the period from 15 July 2023 to 30 August 2023
for the award of degree of B.Tech. Computer Science & Engineering (AI & ML), ABES
Engineering College, Ghaziabad, under the guidance of Mr. Vikas Chaudhary.
Signature of Student
Deepanshu Nirvan
2000321530044
Date:…………….
2
CERTIFICATE
This is to certify that Mr Deepanshu Nirvan has completed Industrial Training during the period from 15
July 2023to 30 August 2023 in our Organization / Industry as a Partial Fulfillment of Degree of Bachelor
of Technology in Computer Science & Engineering (AIML). He was trained in the field of Machine
Learning.
Signature of
Date : …………….
3
ACKNOWLEDGEMENT
Signature of student
Deepanshu Nirvan
2000321530044
4
TABLE OF CONTENTS
DECLARATION 2
CERTIFICATE 3
ACKNOWLEDGEMENT 4
1 INTRODUCTION 6
2 REVIEW OF 9
LITERATURE AND
FEASIBILITY STUDY
3 PROPOSED 11
METHODOLOGY
5 CONCLUSION 16
6 REFERENCES 17
5
1. INTRODUCTION
the complex temporal and spatial dynamics of human actions. In recent years,
deep learning has emerged as a powerful tool for HAR, revolutionizing the
field with its ability to learn complex representations from raw data.
Convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) have become the dominant architectures for HAR, demonstrating
remarkable performance in various datasets and scenarios. CNNs excel at
extracting spatial features from images or frames of videos, while RNNs
effectively capture temporal dependencies in action sequences.
The combination of CNNs and RNNs has led to the development of
sophisticated HAR architectures, such as two-stream networks and
convolutional long short-term memory (ConvLSTM) networks. These
architectures leverage the strengths of both CNNs and RNNs, achieving
state-of-the-art performance in HAR.
Despite significant advances, HAR remains a challenging task due to variations
in appearance, lighting, and background conditions. Additionally, the
complexity of human actions and the need for real-time processing pose further
challenges. Ongoing research is focused on addressing these challenges and
developing more robust, efficient, and real-time HAR systems.
This paper delves into the application of deep learning for HAR, exploring the
underlying principles, architectures, and recent advancements. We present a
comprehensive overview of deep learning-based HAR methods, highlighting
6
their strengths, limitations, and applications. We also discuss the challenges
and future directions of HAR research.
7
1.1 PROBLEM STATEMENT
8
2. REVIEW OF LITERATURE
9
intricate model optimization.
10
3. PROPOSED METHODOLOGY
The proposed framework for action recognition constitutes a sophisticated
integration of Convolutional Neural Networks (CNNs) and Long Short-Term
Memory (LSTM) networks, meticulously designed to holistically capture and
process both spatial and temporal features ingrained within video data. This
system aims to elevate the accuracy and efficacy of action recognition by
synergizing the unique strengths of CNNs and LSTMs.
In the realm of visual analysis, CNNs play a foundational role in dissecting the
visual components encapsulated within individual frames of the video. These
networks excel in discerning intricate visual attributes such as shapes, patterns,
and textures. By adeptly identifying and interpreting these essential visual cues,
CNNs pave the way for a nuanced understanding of distinct actions, thereby
laying the groundwork for subsequent recognition processes.
The fusion of CNNs and LSTMs within this framework fosters a symbiotic
relationship between spatial and temporal comprehension. While CNNs
meticulously capture and encode visual intricacies within each frame, LSTMs
adeptly navigate and analyze these frames in a sequential continuum. This
collaboration empowers the system to synthesize holistic representations of
actions, leveraging both visual information and temporal context for more
nuanced and accurate recognition.
11
Following this comprehensive feature extraction process, the amalgamated
features traverse through a classification layer equipped with a trained algorithm.
This layer harnesses learned patterns to classify observed actions, assigning
probabilities to potential actions and aligning them with predefined labels.
Ultimately, this orchestrated pipeline culminates in refined and precise action
recognition, substantiating the efficacy of this CNN-LSTM amalgamation in
deciphering human actions from video data.
12
13
4. FUTURE SCOPE AND USECASES
Healthcare:In the healthcare sector, action recognition systems using
CNN+LSTM can play a pivotal role in patient monitoring, rehabilitation
assessment, and personalized treatment plans. By analyzing patient
movements and activities, these systems can provide valuable insights into
their physical condition, progress, and potential risks. This information can
be used to tailor rehabilitation exercises, monitor recovery from injuries
or surgeries, and identify early signs of neurological disorders.
15
5.CONCLUSION
People are really interested in recognizing actions, like gestures or movements,
because it's useful for lots of things, such as security cameras, how we interact
with computers, or even in healthcare. But there are still many problems that
haven't been fixed yet.
In our study, we looked at different ways people are trying to solve these
problems. One big issue is that actions can look different from different angles
or when things get in the way. Some methods work okay with this, but none are
perfect. Also, when the camera moves or when there are lots of things
happening in the background, it's hard for computers to understand actions.
Some methods try to fix this, but they still have limits. There's hope for new
ways to fix these problems, like making better systems to recognize actions or
creating new sets of data to test these systems. But right now, there's no one
solution that solves all these issues. We need to explore new ideas and areas to
make a system that can handle all these problems. Our study shows the
problems that still need fixing. This can guide researchers to focus on these
problems and hopefully create a system that's really good at recognizing actions
no matter what challenges come up.
16
6. REFERENCES
[1] M. Ryoo and J. Aggarwal, “Human activity analysis: A Review,” ACM
Computing Surveys, Article 16, vol. 43, pp. 16:1 – 16:43, April 2011. [2] D.
Siewiorek, A. Smailagic, and A. Dey, “Architecture and applications of virtual
coaches,” Proceedings of the IEEE, Invited Paper, vol. 100, pp. 2472–2488,
August 2012.
[3] T. Kanade and M. Hebert, “First-person vision,” Proc. of the IEEE, Invited
Paper, vol. 100, pp. 2442–2453, August 2012.
[4] T. Shibata, “Therapeutic Seal robot biofeedback medical device: qualitative
and quantitative evaluations of robot therapy in dementia care,” Proceedings of
the IEEE, Invited Paper, vol. 100, pp. 2527–2538, August 2012.
[5] K. Yamazaki, R. Ueda, S. Nozawa, M. Kojima, K. Okada,
K. Matsumoto, M. Ishikawa, I. Shimoyama, and M. Inaba, “Home-assistant
robot for an aging society,” Proceedings of the IEEE, Invited Paper, vol. 100,
pp. 2429–2441, August 2012.
[6] P. Kelly, A. Healy, K. Moran, and N. E. O’Connor, “A virtual coaching
environment for improving golf swing technique,” in ACM Multimedia
Workshop on Surreal Media and Virtual Cloning, pp. 51 – 56, October 2010.
[7] L. Palafox and H.Hashimoto, “Human action recognition using wavelet
signal analysis as an input in 4W1H,” in IEEE Intl. Conf. on Industrial
Informatics, pp. 679 – 684, July 2010.
[8] R. Poppe, “A survey on vision-based human action recognition,” Image
and Vision Computing, vol. 28, pp. 976
– 990, June 2010.
[9] Y. Li and Y. Kuai, “Action recognition based on spatio- temporal interest
points,” in Intl. Conf. on BioMedical Engineering and Informatics, pp. 181 –
185, October 2012.
[10] X. Ji and H. Liu, “Advances in view-invariant human motion analysis -
A Review,” IEEE Trans. on Systems, Man, and Cybernetic Part C:
17
Applications and Reviews, vol. 40, pp. 13 – 24, January 2012.
[11] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, “The i3DPost
multi-view and 3D human action/interaction database,” in Conf. for Visual
Media Production, pp. 159 – 168, November 2009.
[12] A. F. Bobick and J. W. Davis, “The recognition of human movement
using temporal templates,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 23, pp. 257 – 267, March 2001.
[13] T. Darell and A. Pentland, “Space-time gestures,” in IEEE Conf. on
Computer Vision and Pattern Recognition, pp. 335 – 340, 1993.
[14] G. Rogez, J. Guerrero, and C. Orrite, “View-invariant human feature
extraction for video-surveillance applications,” in IEEE Conf. on Advanced
Video and signal based Surveillance, pp. 324 – 329, 2007.
[15] H. Ragheb, S. Velastin, P. Remagnino, and T. Ellis, “Human action
recognition using robust power spectrum features,” in IEEE Conf. of Image
Processing, pp. 753–756, October 2008.
[16] Y. Lu, Y. Li, Y. Chen, F. Ding, X. Wang, J. Hu, and S. Ding, “A Human
action recognition method based on Tchebichef moment invariants and
temporal templates,” in Intl. Conf. on Intelligent Human-Machine Systems
and Cybernetics, pp. 76–79, August 2012.
18