Project Report Format DBUU
Project Report Format DBUU
Project Report Format DBUU
<TITLE>
Submitted in the Partial Fulfilment of the Requirements for the Degree of
BATCHLORE OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
by
<Guide Name>
<Designation>
Submitted to the
Department of Computer Science and Engineering
School of Engineering & Engineering (SoEC)
DEV BHOOMI UTTARAKHAND UNIVERSITY,UTTARAKHAND-248001
DECEMBER 2023
CANDIDATE’S DECLARATION
I hereby declare that the work presented in this project titled, “Title Name” submitted by me
in the partial fulfilment of the requirement of the award of the degree of Batchlor of
Technology (B.Tech.) submitted in the Department of Computer Science & Engineering,
Uttarakhand TechnicalUniversity, Dehradun, is an authentic record of my thesis carried out
under the guidance of <Guide Name>, <designation>, Department of Computer Science
and Engineering under SoEC, Dev Bhoomi Uttarakhand University, Dehradun
ii
CERTIFICATE
It is to certify that the thesis entitled “Title Name” which is being submitted by Student Name
to the UttarakhandUniversity Dehradun, in the fulfilment of the requirement for the award of
the degree of Batchlor of Technology (B. Tech.) is a record of bonafide research work carried
out by him under myguidance and supervision. The matter presented in this thesis has not been
submitted either in part or full to any University or Institute for award of any degree.
<Guide Name>
<Designation>
Department of Computer Science and Engg.
Dev Bhoomi Institute of Technology, Dehradun
(Uttarakhand) INDIA
iii
ABSTRACT
If you’ve ever gone back to rewatch the key focuses of a football or cricket coordinate long
after it was over, you know how impactful and vital highlight recordings can be! But that’s not
the as it were thing highlight recordings are confined to. From exhibiting the highpoints of a
wedding ceremony to meeting recordings they come in convenient in a few ways. Highlight
recordings basically whole up the most takeaways and noteworthy minutes of an occasion into
brief, effectively edible pieces so that watchers don’t get to observe the whole thing in arrange
to induce an understanding of what happened. But fair since these are brief, they don’t ought
to be boring and boring.
iv
ACKNOWLEDGEMENT
At this ecstatic time of presenting this dissertation, first, the author bows to almighty God for
blessing with enough patience and strength to go through this challenging phase of life.
I would like to express a deep sense of gratitude and thanks to those people who have helped
me in the accomplishment of this M. Tech. thesis.
First and foremost, I would like to thank my supervisor, Mr. Dhajvir Singh Rai for their
expertise, guidance, enthusiasm, and patience. These were invaluable contributors whose
insightful guidance helped to the successful completion of this dissertation and spent many
hours patiently answering questions and troubleshooting the problems.
Beyond all this, I would like to give special thanks to my parents, Husband and daughter for
the unbounded affection, sweet love, constant inspiration, and encouragement. Without their
support this research would not have been possible.
Finally, I would like to thank all faculty, college management, administrative and technical
staff of School of Engineering & Computing, Uttarakhand Technical University,
Dehradun for their encouragement, assistance, and friendship throughout my candidature.
v
TABLE OF CONTENTS
Page
No.
Candidate’s Declaration i
Certificate ii
Abstract. iii
Acknowledgements
Contents.
List of Figures
List of Tables
REFERENCES.
vi
LIST OF FIGURES
celebration
was conducted
vii
LIST OF TABLES
viii
CHAPTER 1
INTRODUCTION
1.1 Overview
The invention is aimed to target the video segment where the content is growing in size/length
and in number with the recent surge in the online meeting situation due to the pandemic. We
are now in a world where hybrid work is a norm and thus meetings mostly happen online on
platforms like Microsoft Teams, Zoom, Google Meet etc. This has led to generation of a lot of
video content in the form of recordings that the organizations deal with in different ways where
some choose to also record all meetings automatically. When meetings get recorded, they help
a great deal for the audience to revisit the content and help people to come to meetings offline
while they missed those.
In this context the problem at hand is to make available a concept of “video highlights” what
could help extract useful sections of a video recording for quick access and consumption. These
highlights can be defined for the audience basis their domain of work and need for information.
More on this has been covered in the user scenario section of this document.
The highlight generation system can be broken down into components like Pre-processing,
Feature extraction, Classification, Highlight Recognition and Highlight Extraction as shown in
Figure 1
The following user scenarios are the real use cases we believe this would apply to strongly and
how this will be used, talking about each domain in some detail:
Education:
In the EDU sector the amount of content that is generated in the form of recordings is huge and
that has taken more steam with the pandemic and online mode of education. With more and
more shift towards digital education where teachers from different parts of the world are
bringing classes on the web to make sure we use technology to its fullest. With this the problem
9
that came up is the amount of content that got generated vs consumed. This solution will help
bridge that gap and introduce a way for the consumers to evaluate what they need to spend time
on and what can be skipped basis their need. This would also help educators to understand the
analyze the type of content consumers are interested in.
Corporates:
In the business domain, meetings are one of the major use cases not just due to the pandemic
which has surged the numbers for sure but then even before that. In corporations there are many
meetings and the majority of those are recorded. To an extent that organization develop their
10
CHAPTER 2
LITERATURE REVIEW
One of the ways for extricating highlights from a badminton video was suggested by (Tao, Luo,
Shang, & Wang, 2020). This was suggested in an approach where, Firstly, classify the
distinctive sees of badminton recordings for division of video through classification building
demonstrate based upon learning via exchange, and accomplish high level precision along with
real-time division. Besides, relying on protest discovery by question identifying demonstrate
YOLOv3, finding players in a video section and calculating the players’ normal speed so as to
extricate the highlights from a video of badminton match. Video segments with higher normal
speed of players’ points towards the serious scenes/portions of a badminton game diversion, so
are able respect them as the sought highlights in a manner. Then extricate highlights by sorting
of the said badminton video fragments with higher normal speed players’ as identified before,
which leads users to save considerable time to appreciate the highlights of a whole video. Along
the side assess the strategy proposed by confirming if a fragment had conceded objective points
of interest for example energizing reaction of the gatherings of people and positive or optimistic
assessment of the storytellers. In similar lines, (Ringer, Nicolaou, & Walker, 2022) suggested
that the crude information is by and large communicated through the extraction frameworks of
features, e.g. a CNN or Convolutional Neural Network or sound recurrence examination
methods. Whereas it is hypothetically conceivable to act on the crude information itself, there
are for the most part numerous inputs within both the cases i.e., sound as well as visual
information, eventually making this a lot more challenging. Hence, employing a include
extractor entrusted with detecting a littler number of features that are salient is favoured. They
are at that point utilized within the frameworks downstream. Nearing the conclusion of the
pipeline, there has to be an instrument for decision making that decides on the off chance that
the approaching highlights do not shape portion of a highlight, or they do. The yield or output
from the said decision-making instrument is as a rule an ultimate yield of the system and
comprises of a flag highlighting time series over the entire length of a video wherein each point
within that flag has a relation to the probability that one single video fragment even may be a
highlight. Also, on a few points in between crude input and last yield, the information from
every input methodology for example information that is visual, sound information, ought to
10
be coalesced or merged to bind the framework together. Precisely how the combination
component is actualized may shift among models. For an illustration, it might be conceived
that for specific information groups to meld the crude inputs, e.g., connecting an RGB and
Optical/Visual Stream picture with each other at the crude information level. Then again,
concatenation can also happen post include extraction such that the process of decision making
has a lone set of highlights with which to create a choice, which we may allude to as a ‘fusion
feature’. At long last, possibly every modality is prepared totally independently, counting
choice making, and after that the choices made for each methodology are amassed in a few
frames. This accumulation is what we call ‘model fusion’.
11
22
CHAPTER 3
PROPOSED METHODOLOGY
As we concluded in the previous chapter and also learnt from the literature review of the work
done so far in this space, there is limited work done when it comes to the subject and domain
of recordings for meetings and mostly the focus has been to generate highlights for sports
recordings and that so based on the expression depicted from the audio of the recording. With
that thought in mind, here the proposed methodology is multi fold which is detailed in the
following section of this chapter. In here will try to create a complete view of what we are
trying to do.
We have tried to use video recordings for meetings where we start by doing a lot of refinement
and pre-processing on the subject video which is intended to remove any noise from the
recording including the sections of video where there was no conversation happening. Also,
we try extracting the audio from the video recording to be used as a specific attribute for our
algorithm in the process further. With all the pre-processing out of the way the next goal of the
proposed method is to try getting the audio file transcript and run speaker diarization over that.
Speaker diarization as a concept is explained in detail in the following sections of this chapter,
but in a summary, it is the concept of figuring out the number of speakers during a conversation
and then we use this information to divide the audio/video file into segments on the basis of
the speakers who spoke in that section. This helps us create a group or collection of smaller
audio/video sections of the original recording basis when one of the different identified
speakers spoke during the meeting length.
Our next step is to then to use the input from the user to understand what highlight context is
he/she looking for in the video. This is needed to make the system more configurable and
personalized where the user can define the context in which they intend to generate the
highlights from the recordings. Once the context is provided by the user we use the Naïve
Bayes Algorithm to define the probability of the sections of video recording which have
depicted that sentiment and also rate them basis the probability at which they have
demonstrated the sentiment we are trying to narrow down to. In this step we are then using this
23
information and the data from our previous step/component to narrow down even smaller
sections in audio/video file segments we created earlier to find the required sentiment from
those smaller groups.
This is then followed by using our highlighter engine to give a highlight score to all the
audio/video segments we have thus generated so far from the previous two components. How
this is done is by suing two different theories, firstly we use the probability assigned to each of
the video segments basis the sentiment that they have depicted and this is our first parameter
to the highlight formula as presented in this proposed methodology. The second aspect of this
is that with our experiments we have observed that in such video recorded meetings the
speakers who have spoken more are given lesser weightage as they are not generating a lot of
value from a highlight perspective. Taking a use case to understand this, in an executive level
meeting if there is a disagreement between parties over a topic the most important section
maybe is when one of the executives used strong words to express that emotion and then would
choose to stay silent, in this case the executive will be given a higher weightage for our score.
We use this then to generate a score for all the video segments we have created so far and the
details of generating this highlight score have been detailed in the following sections of this
chapter. Here we use this highlight score to then order the video segments in decreasing order
of their score to be treated has highlights and given as result to our users.
The invention can be broken down into the following broader areas of work/implantation. We
shall discuss the role of each individual component and finally talk about how the system comes
together to deliver the objective.
This component deals with taking the video file in the supported formats and subjecting it to
the extraction algorithm to take the audio file in known formats that we can use for further
processing. Once the audio file is received, the audio is done with pre-processing to extract
noise which can be in the form of silent sections or background disturbances. After the pre-
processing step we then run feature extraction which helps us train the model and get the
number of speakers and the sections in audio where each of those speakers were talking. We
care calling these sections of audio as speaker conversations and are mapped to each speaker
24
in the form of start and end times in the whole audio script where that speaker was talking in
the meeting.
The capacity to improve an audio segment that is noisy by the virtue of removing the main
audio content is considered to be the process of reducing background noise. Background noise
elimination is utilised almost in all areas, including in video conferencing systems, software
used for editing videos and audio files, and headphones with noise cancellation features.
Reducing background noise is still an area that is fast growing and evolving when spoken about
in technology space, and artificial intelligence has opened up a whole new range of methods
for doing it better.
Recurrent neural networks are models with sequential data recognition and comprehension
capabilities. In order to understand what is sequential data we can consider examples of the
location of an object across time or say music and similarly text.
RNNs are especially good at eliminating background noise because they can recognise patterns
over long periods of time, which is necessary for interpreting audio.
A feed-forward neural network with an input layer, a hidden layer, and an output layer as its
three primary layers. As the model goes through each item in a sequence, with recurrent neural
networks there is also an availability of a feedback loop that can be considered as hidden state
that is abstracted from the hidden layer which keeps updating itself.
Audio sample may be divided into a series of equally spaced time segments. The hidden state
is updated throughout each iteration, keeping track of the prior steps each time, with the
submission of individual sequence sample into the recurrent neural network. After each cycle,
the output is routed via a feed-forward neural network to create a new audio stream that is
completely free of background noise.
Speaker Diarization:
25
human beings, the task gets complex and the most relevant and needful solution available is
speaker diarization.
When in an audio conversation the file or the recording can be broken into segments or sections
in which we can uniquely identify who is the speaker that the conversation is happening
between, makes it much simpler for human understanding and as well as in the field of Artificial
intelligence. This enabled both to comprehend the context and flow of the dialogue in subject.
Speaker diarization can be achieved by doing the following steps which is a two-way process:
Finding Speakers:
Speaker segmentation is another word for the same. In this step mostly analyses the
characteristics and zero-crossing rates of each voice to determine who is speaking and when.
Gender of each speaker is identified by features such as pitch.
Clustering Speakers:
Once a speaker is recognized, it is divided into separate segments, for the whole conversation
to be correctly marked or tagged and also understood easily, all the non-speech sections are
skipped. In order to do this the approach of probabilistic analysis is used in order to identify
the number of people who were contributing to the dialogue at a particular point of time.
Algorithm:
Step 1: Take input as the video file and extract audio using MoviePy library.
Step 2: Take the audio like and subject to pre-processing to remove noise.
Step3: Now the file is subjected to the voice activity detector that will remove the speech
sections from the non-speech sections, thus trimming silences from the audio
recording/file.
Step4: Now we break the audio into segments that are of varied length, these are created
basis the statements made in the conversation. Let’s call them audio segments of
36
CHAPTER 4
EXPERIMENTAL RESULTS
There's an impressive riches of recordings made by fans individually and content livestreamed
and recorded. The livestream content or recording can be considered as an amalgamation of
highlighted sections which are not really highlights but intentionally captured sections of the
recordings, but then the concluded highlighted segments are shown or considered as highlights
of the recording. In the ideal case and assuming the right context the highlight content from
any recording can be done easily by using manual interference and inference. An example to
this is we could select a set of people who could watch the recorded content and then help us
select the sections in the content that was their favourite and let us know as highlights. As much
as this is desirable the manual step here is very costly and time taking and thus encourages us
to take advantage of the available cutting edge technology and datasets that can help us achieve
the same thing in no time and with no or minimal manual intervention.
Alternative sensible way of handling this could be to adjust the interesting video fragments by
the ones within the unique livestream recording, e.g. by recognizing those outlines in the
livestream content that are too within the highlight video that is in context here. In any case,
whereas smaller amount of expending as compared to human comment, it again features a
critical information collection punishment similar to citing traditions for the highlights and
livestream recordings together are unpredictable, needing human pre-processing. Moreover,
the outline coordinating prepare is costly from a computation standpoint. In this kind of
approach that is need of an equally desired highlight video for every livestream recorded video
and a livestream recorded video for every highlight video within the dataset. Such constraints
and difficulties make robotized recorded content highlights coordinating illogical for datasets
that are large-scale.
Instep, the approach which is of unlabelled content that is positive in nature, suggested by
(Xiong, Kalantidis, Ghadiyaram , & Grauman, 2019) prescribes collection of two datasets, one
that contains blended names, from livestream recordings and other of the positive names, from
suggested highlight recordings, although they have no relation among the various datasets used.
37
Following same approach, we thus collected many meeting video recordings that were self-
curated and also collected from various meeting recording sources. The dataset was then
divided into groups of training data set and test data sets.
Figure 4.1 Screenshot from a recording on which the experiment was conducted
38
Above results demonstrate the F1 scores for our algorithm that generates the highlights. It is
descriptive of the value where the highlight generated was effectively denoting the required
sentiment in the video and then the segment that was extracted was also demonstrating the
required expression.
We ran the algorithm on various set of video recordings that included recordings ranging from
those of meetings that had a lot of conflicts and arguments to video recordings that did not have
any specific sentiment highlighted. With this we were able to find out that our algorithm does
not perform well for videos that did not have any specific highlighted sentiment.
Also, with the algorithm being run on technical training recordings it was observed that the
algorithm generated too many segments of the video that were considered to be highlights and
thus was not very effective to create a summary. We shall choose to take this up as a future
improvement to our work.
We then also ran the experimentation with different sets of training/validation and test data and
observed our scores with each of these combinations to end up with the following scores. The
list of figures below shows the results observed on the algorithm with various combinations of
the dataset subsets.
40
Figure 4.3 Graph demonstrating results with accuracy of 0.85
41
Figure 4.5 Results with executed with 40/20/40 dataset combination
42
CHAPTER 5
CONCLUSION & FUTURE SCOPE
4.1 Conclusion
43
REFERENCES
1. Agyeman, R., Muhammad, R., & Choi, G. S. (2019). Soccer Video Summarization
using Deep Learning. IEEE, 270-273.
2. Bertini, M., Bimbo, A. D., & Nunriati, W. (2004). Common Visual Cues for Sports
Highlights Detection. IEEE, 1399-1402.
3. Chakraborty, P. R., Tjondronegoro, D., Zhang, L., & Chandran, V. (n.d.). Using
Viewer’s Facial Expression and Heart Rate for Sports Video Highlights Detection.371-
378.
4. Chakraborty, R. P., Tjondronegoro, D., Zhang, L., & Chandran, V. (2016). Automatic
Identification of Sports Video Highlights using Viewer Interest Features. 55-62.
5. CHING, W.-S., TOH, P.-S., & ER, M.-H. (n.d.). A New Specular Highlights Detection
Algorithm Using Multiple Views. 474-478.
7. Gao, X., Liu, X., Yang, T., Deng, G., Peng, H., Zhang, Q., . . . Liu, J. (2020).Automatic
Key Moment Extraction And Highlights Generation Based On Comprehensive Soccer
Video Understanding. IEEE, 1-6.
8. Gygli, M., Grabner, H., & Gool, L. V. (2015). Video Summarization by Learning
Submodular Mixtures of Objectives. IEEE, 3090-3098.
11. Hanjalic, A. (2005). Adaptive Extraction of Highlights From a Sport Video Based on
Excitement Modeling. IEEE, 1114-1122.
12. Hsieh, J.-T. T., Li, C. E., Liu, W., & Zeng, K.-H. (n.d.). Spotlight: A Smart Video
Highlight Generator. stanford.edu, 1-7.
44
13. Hu, L., He, W., Zhang, L., Xiong, H., & Chen, E. (2021). Detecting HighlightedVideo
Clips Through Emotion-Enhanced Audio-Visual Cues. IEEE.
14. Jiang, K., Chen, X., & Zhao, Q. (2011). Automatic composing soccer video highlights
with core-around event model. IEEE, 183-190.
15. Jiang, R., Qu, C., Wang, J., Wang, C., & Zheng, Y. (2020). Towards Extracting
Highlights From Recorded Live Videos: An Implicit Crowdsourcing Approach. IEEE,
1810-1813.
16. Kostoulas, T., Chanel, G., Muszynski, M., Lombardo, P., & Pun, T. (2015). Identifying
aesthetic highlights in movies from clustering of physiological andbehavioral signals.
IEEE.
17. Kudi, S., & Namboodiri, A. M. (2017). Words speak for Actions: Using Text to find
Video Highlights. Asian Conference on Pattern Recognition.
18. Li, Q., Chen, J., Xie, Q., & Han, X. (2020). Detecting boundaries of absolutehighlights
for sports videos.
19. Liu, C., Huang, Q., Jiang, S., & Zhang, W. (2006). Extracting Story Units In Sports
Video Based On Unsupervised Video Scene Clustering. IEEE, 1605-1608.
20. Longfei, Z., Yuanda, C., Gangyi, D., & Yong, W. (2008). A Computable Visual
Attention Model for Video Skimming. IEEE, 667-672.
21. Ma, Y.-F., & Zhang, H. J. (2005). Video Snapshot: A Bird View of Video Sequence.
IEEE.
22. Marlow, S., Sadlier, D. A., O’Connor, N., & Murphy, N. (2002). Audio Processing for
Automatic TV Sports Program Highlights Detection. ISSC.
23. Merler, M., Joshi, D., Nguyen, Q.-B., Hammer, S., Kent, J., Smith, J. R., & Feris, R.
S. (2017). Automatic Curation of Golf Highlights using Multimodal Excitement
Features. IEEE, 57-65.
24. Merler, M., Mac, K.-H. C., Joshi, D., Nyugen, Q.-B., Hammer, S., Kent, J., . . . Feris,
R. S. (2018). Automatic Curation of Sports Highlights using Multimodal Excitement
Features. Ieee Transactions On Multimedia, 1-16.
45
25. Ngo, C.-W., ma, Y.-F., & ZhANG, H.-J. (2005). Video Summarization and Scene
Detection by Graph Modeling. IEEE, 296-305.
26. Pun, H., Beek, p. v., & Sezan, M. I. (2001). Detection Of Slow-Motion ReplaySegments
In Sports Video For Highlights Generation. IEEE, 1649-1652.
27. Ringer, C., Nicolaou, M. A., & Walker, J. A. (2022). Autohighlight: Highlight
detection in League of Legends esports broadcasts via crowd-sourced data. Machine
Learning with Applications, 1-15.
28. Shih, H.-C., & Huang, C.-L. (2004). Detection Of The Highlights In Baseball Video
Program. IEEE, 595-598.
29. Tang, h., Kwatra, V., Sargin, M. E., & Gargi, U. (n.d.). Detecting Highlights In Sports
Videos: Cricket As A Test Case.
30. Tang, H., Kwatra, V., Sargin, M., & Gargi, U. (2011). Detecting Highlights In Sports
Videos: Cricket As A Test Case. IEEE.
31. Tang, K., Bao, Y., Zhao, Z., Zhu, L., Lin, Y., & Peng, Y. (2018). AutoHighlight :
Automatic Highlights Detection and Segmentation in Soccer Matches. IEEE, 4619-
4624.
32. Tao, S., Luo, J., Shang, J., & Wang, M. (2020). Extracting Highlights from a
Badminton Video Combine Transfer Learning with Players’ Velocity. International
Conference on Computer Animation and Social Agents, 82-91.
33. Tjondronegoro, D. W., Chen, Y.-P. P., & Pham, B. (2004). Classification of Self-
Consumable Highlights for Soccer Video Summaries. IEEE, 579-582.
34. WAN, K., YAn, X., & Xu, C. (2005). Automatic Mobile Sports Highlights. IEEE.
35. Wang, H., Yu, H., Chen, P., Hua, R., Yan, C., & Zuo, L. (2018). Unsupervised Video
Highlight Extraction via Query-related Deep Transfer. 24th International Conference
on Pattern Recognition, 2971-2976.
36. Wu, P. (2004). A Semi-automatic Approach to Detect Highlights for Home Video
Annotation. IEEE, 957-960.
37. Wung, P., Cui, R., & Yang, S.-Q. (2004). Contextual Browsing For Highlights InSports
Video. IEEE, 1951-1954.
50
38. Xiao, B., Yin, X., & Kang, S.-C. (2021). Vision-based method of automatically
detecting construction video highlights by integrating machine tracking and CNN
feature extraction. Automation in Construction, 1-13.
39. Xiong, B., Kalantidis, Y., Ghadiyaram , D., & Grauman, K. (2019). Less Is More:
Learning Highlight Detection From Video Duration. Less Is More: Learning Highlight
Detection From Video Duration, 1258-1267.
40. Xiong, Z., Radhakrishnan, R., Divakaran, A., & Huang, T. S. (2005). Highlights
Extraction From Sports Video Based On An Audio-Visual Marker Detection
Framework. IEEE.
41. Xiong, Z., Radhakriuhnan, R., Divakaran, A., & Huang, T. S. (2004). Effective And
Efficient Sports Highlights Extraction Using The Minimum Description Length
Criterion In Selecting Gmm Structures. IEEE, 1947-1950.
42. Yang, H., Wang, B., Lin, S., Wipf, D., Gio, M., & Guo, B. (2015). Unsupervised
Extraction of Video Highlights Via Robust Recurrent Auto-encoders. IEEE
International Conference on Computer Vision, 4633-4641.
43. Yao, T., Mei, T., & Rui, Y. (n.d.). Highlight Detection with Pairwise Deep Ranking for
First-Person Video Summarization. IEEE, 982-990.
51