0% found this document useful (0 votes)

17 views8 pages

Implementation_of_Simple_and_Efficient_P

The research article presents a system for generating image captions using a combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. The proposed method processes images from the Flickr8k dataset, extracting features with CNN and generating natural language captions with LSTM, demonstrating its potential for applications in various fields. Results indicate that while the system shows promise, accuracy can be improved with larger datasets.

Uploaded by

darbyava27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views8 pages

Implementation_of_Simple_and_Efficient_P

Uploaded by

darbyava27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

11

Research Article
Volume-1 | Issue-1| Jan-Jun-2024|
JOURNAL OF
Artificial Intelligence and
Imaging
Double Blind Peer Reviewed Journal
DOI: https://doi.org/10.48001/JoAII

Implementation of Simple and Efficient Picture Caption

Generator

Vijay Mane1*, Riddhi Selkar1

1
Department of Electronics and Telecommunication Engineering, Vishwakarma Institute of Technology, Pune,
Maharashtra, India
*
Corresponding Author’s Email: vijay.mane@vit.edu

ARTICLE HISTORY: ABSTRACT: Image captioning or picture captioning has become one of the most widely
used technologies in applications that generate and provide captions for specific
Received: 8th Jan, 2024
photographs. All these things are done with the help of deep neural networks. It identifies
Revised: 25th Jan, 2024
the specific objects in an image and their attributes and relationships. The purpose of this
Accepted: 8th Feb, 2024 research is to find different things in a photograph, figure out their relationships, and
Published: 20th Feb, 2024 write captions. The proposed system is implemented on dataset Flickr8k along with
python. The input images are pre-processed and then features from images are extracted
KEYWORDS:
using CNN. To translate the features and objects extracted by CNN to a natural sentence in
Caption, CNN English LSTM is utilized in the implementation. Different types of images are tested with
(Convolutional Neural
the proposed system. The results are presented with the generated image captions. The
Networks), Deep learning,
results presented shows the accuracy of the system. The presented method has potentials
Image captioning, LSTM
(Long Short-Term for such applications where image captioning is essential.
Memory)

1. INTRODUCTION poses a formidable challenge in automatically

With captions for every photograph on the internet, true summarizing image content using well-constructed English
photograph discovery and indexing can be done faster and words. Despite its complexity, this task holds significant
more accurately. Image captioning is used in a wide range promise, particularly in aiding visually impaired
of fields, including medical, business, internet search, and individuals to gain a clearer comprehension of the visual
the military, to name a few. Captions can be generated content present in web images (Hossain et al., 2019;
automatically on Instagram, Facebook, and other social Alahmadi et al., 2019).
media platforms (Wang et al., 2020; Sharma et al., 2019).
The main goal of this study is to get a fundamental
understanding of deep learning approaches. Employing
both Convolutional Neural Networks (CNN) and Long
Short-Term Memory (LSTM) for image classification

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

1.2 LSTM

It is a type of Recurrent Neural Network (RNN) which can

learn dependence in the order while we deal with
prediction problems (Bai & An, 2018). This is mostly used
in complex situations such as Machine Translation, Speech
Recognition, etc. LSTM addresses the shortcomings of
traditional Recurrent Neural Networks (RNN) by
effectively preserving pertinent information across input
processing, while filtering out extraneous details. After
analyzing images by CNN, the meaningful sentences can
be generated using LSTM.
Figure 1: Input and Output.

1.1 CNN

Convolutional Neural networks important sensory systems

can generate information in a specific format, such as 2D
image. When it comes to working with images, lattice and
CNN are helpful (Mathur et al., 2017; Pranay et al., 2017).

Figure 3: LSTM.

1.3 CNN-LSTM Architecture

The CNN-LSTM architecture integrates Long Short-Term

Memory (LSTM) units for predicting output word
sequences with Convolutional Neural Network (CNN)
layers, which play a role in extracting features from the
input data.

Figure 2: CNN-LSTM Architecture.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

related to one another (that's just basic image

classification). In this study, we develop a generative
model that uses RNN to efficiently produce meaningful
sentences.

In this study, we have adopted the ZigBee network as the

foundational framework for our system. We introduce a
home automation system centered around a digital door lock,
capitalizing on the extensive capabilities of the ZigBee
sensor network to seamlessly integrate home security and
automation. Within our proposed system, a ZigBee module is
intricately integrated into the digital door lock, serving as the
central main controller for the entire setup. This innovative
design aims to maximize the potential of ZigBee technology
in harmonizing home security and automation functionalities.

• Image based Model - CNN - Extracts the features

from the image (Albawi et al., 2017).

• Language based Model - LSTM - It converts the

features and objects identified by the Convolutional
Neural Network (CNN) into a coherent English
sentence.

3.1 Dataset

• We use Flickr 8K dataset having 8000 images which

has 5 captions for each image in the collection.

Figure 4: Flowchart. • A Flickr 8k training dataset.trainImages.txt has 6,000

images.

• A 1,000-image development dataset called Flickr

2. LITERATURE SURVEY
8k.devImages.txt is available.
Looked into the issue of using natural language processing
• A test collection of data from Flickr 8k.The dataset
to describe visual data automatically. It uses a visual
1,000 photos includes all testImages.txt.
primitive recognizer and formal language, for example, to
convert visual to text. The conversion of And-Or Graphs or 3.2 Methodology
logical systems into natural language is typically achieved
• Preprocessing of image.
through rule-based systems. However, these methods
prove to be less effective and find limited application, • Creating vocabulary for the image.
primarily in specific domains like sports and traffic a. Adding data to the load.
scenarios. In contrast, the field of representing graphical
information through natural language processing has b. Creating a lexicon of descriptions that maps images.
experienced a surge in interest and popularity in recent c. Eliminating punctuation, changing all text to
years. Natural language processing techniques have lately lowercase, and deleting words with numbers in them.
been used to identify objects based on their characteristics
d. Sorting out all the distinctive words and constructing
and locations (Wu & Zhou, 2019; Bahdanau et al., 2014).
a vocabulary from each description.
3. SYSTEM DESCRIPTION
e. Making a descriptions.txt file to keep track of all the
This model was helpful in identifying the objects in an captions.
image, but it was unable to tell us how those objects
DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)
14

• Train the dataset.

• Evaluating the model.

• Testing on individual images.

3.3 Preprocessing

Images and their related captions are processed

independently. Input data is entered into the Xception
Keras API application, which works beyond TensorFlow,
to create an image. ImageNet was previously trained in
Xception. With the help of transfer learning, we were able
to train the images quickly. Definitions are cleared using
the Keras' tokenizer section, which displays the text and
saves it in its dictionary. Then, for each word in the
vocabulary, a unique reference number is assigned.

4. RESULTS

Figure 6: Image 2.

Figure 7: Image 3.
Figure 5: Image 1.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

Figure 10: Image 6.

Figure 8: Image 4.

Figure 9: Image 5. Figure 11: Image 7.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

Figure 12: Image 8.

Figure 14: Image 10.

Figure 13: Image 9.

Figure 15: Image 11.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

Table 1: Comparison Between Original and Predicted Values.

Image Original Description Predicted Description

Image1 White crane is standing in the water White crane is flying over the water

Men in red shirt and black pants is walking Men in red shirt and black pants is walking down the
Image2
down the snowy hill snowy hill

Image3 Man is snowboarding on the side of mountain Man is snowboarding on the side of mountain

Image4 Man is standing on the rock Man in red shirt is standing on the rock

Image5 Man with red helmet is ridding bike on road Man in red shirt is riding bike on the side of road

Young boy is jumping into the air to climb into the

Image6 Young girl is playing in water
water

Image7 Ship is standing in the water Man in red kayak is walking on the beach

Image8 Man is kayaking in the water Man is kayaking in the water

Image9 Man in red shirt is sitting on the bench Man in red shirt is walking on the street

Image10 5 people are standing on grass Man in the red shirt walking on the street

Image11 Brown dog is running through the grass Brown dog is running through the grass

5. CONCLUSION so it cannot predict the words that are out of its vocabulary.
We could try other algorithms and methodologies for
The paper presents the implementation of image caption
increasing the accuracy of generating captions. For the
generator using the CNN-LSTM. This field has an
computer to speak every created caption, we may also
increasing rate for implementing applications for example
include a text-to-speech converter for applications used by
cases in Computer Vision(CV) and NLP domains.
blind people. The outcomes are displayed alongside the
Accuracy of the model is less for generating captions
automatically generated image captions, showcasing the
sometimes it may generate wrong captions or incomplete
system's accuracy. The demonstrated method holds promise
captions. This is due to the small dataset. By using large
for applications where image captioning is a crucial
dataset having 100000 images, we can generate more
requirement.
accurate models. The model depends on the dataset we use,

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

REFERENCES caption generator. In 2017 International Conference

on Computational Intelligence in Data Science
Alahmadi, R., Park, C. H., & Hahn, J. (2019, March).
(ICCIDS) (pp. 1-6). IEEE.
Sequence-to-sequence image caption generator.
https://doi.org/10.1109/ICCIDS.2017.8272660.
In Eleventh International Conference on Machine
Vision (ICMV 2018) (Vol. 11041, pp. 85-91). SPIE. Pranay, M., Aman, G., Aayush, Y., Anurag, M., & Nand,
https://doi.org/10.1117/12.2523174. B. (2017). Camera2Caption: A real-time image
caption generator. In International Conference on
Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017,
Computational Intelligence in Data Science(ICCIDS).
August). Understanding of a convolutional neural
https://doi.org/110.1109/ICCIDS.2017.8272660.
network. In 2017 International Conference on
Engineering and Technology (ICET) (pp. 1-6). IEEE. Sharma, G., Kalena, P., Malde, N., Nair, A., & Parkar, S.
https://doi.org/10.1109/ICEngTechnol.2017.8308186. (2019, April). Visual image caption generator using
deep learning. In 2nd International Conference on
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural
Advances in Science & Technology (ICAST).
machine translation by jointly learning to align and
http://dx.doi.org/10.2139/ssrn.3368837.
translate. arXiv preprint arXiv:1409.0473.
https://doi.org/10.48550/arXiv.1409.0473. Wang, H., Zhang, Y., & Yu, X. (2020). An overview of
image caption generation methods. Computational
Bai, S., & An, S. (2018). A survey on automatic image
Intelligence and Neuroscience, 2020.
caption generation. Neurocomputing, 311, 291-304.
https://doi.org/10.1155/2020/3062706.
https://doi.org/10.1016/j.neucom.2018.05.080.
Wu, Q., & Zhou, Y. (2019, May). Real-time object
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H.
detection based on unmanned aerial vehicle. In 2019
(2019). A comprehensive survey of deep learning for
IEEE 8th Data Driven Control and Learning Systems
image captioning. ACM Computing Survey, 51(6), 1-
Conference (DDCLS) (pp. 574-579). IEEE.
36. https://doi.org/10.1145/3295748.
https://doi.org/10.1109/DDCLS.2019.8908984
Mathur, P., Gill, A., Yadav, A., Mishra, A., & Bansode, N.
K. (2017, June). Camera2Caption: A real-time image
.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

Image Caption
No ratings yet
Image Caption
16 pages
Modern Persian - Conversation & Grammar
89% (9)
Modern Persian - Conversation & Grammar
414 pages
Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam
No ratings yet
Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam
21 pages
Embodied Intelligence
100% (1)
Embodied Intelligence
31 pages
final-year-project-report
No ratings yet
final-year-project-report
52 pages
Internship Report (Sanjay Final)
No ratings yet
Internship Report (Sanjay Final)
45 pages
Review 2
No ratings yet
Review 2
34 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
Final Project Report
No ratings yet
Final Project Report
18 pages
Henry T Laurency The Way of Man
100% (1)
Henry T Laurency The Way of Man
539 pages
Aust Cse Thesis Final Book
No ratings yet
Aust Cse Thesis Final Book
72 pages
Sample project doc-REC
No ratings yet
Sample project doc-REC
66 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
BTP Report
No ratings yet
BTP Report
27 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Image Captioning Research Paper
No ratings yet
Image Captioning Research Paper
59 pages
SG - 21st Century Skills in The Context of MATATAG Curriculum
100% (2)
SG - 21st Century Skills in The Context of MATATAG Curriculum
24 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
Minor
No ratings yet
Minor
14 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Automated Neural Image Caption Generator For Visually Impaired People
No ratings yet
Automated Neural Image Caption Generator For Visually Impaired People
6 pages
14
No ratings yet
14
8 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
DL project report
No ratings yet
DL project report
10 pages
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
No ratings yet
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
7 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
Review 3
No ratings yet
Review 3
18 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Automatic Image Captioning Combining Natural Language Processing and
No ratings yet
Automatic Image Captioning Combining Natural Language Processing and
14 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Project Review
No ratings yet
Project Review
12 pages
Fin Irjmets1689950550
No ratings yet
Fin Irjmets1689950550
5 pages
RP Springer
No ratings yet
RP Springer
10 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Image Caption Generation
No ratings yet
Image Caption Generation
8 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Image Caption Generator Using CNN and LSTM
No ratings yet
Image Caption Generator Using CNN and LSTM
8 pages
2501
No ratings yet
2501
6 pages
DLL - Q4 W5 Math 8
No ratings yet
DLL - Q4 W5 Math 8
6 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Org-Mgt Q1 M1
No ratings yet
Org-Mgt Q1 M1
12 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
No ratings yet
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
5 pages
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
No ratings yet
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
6 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Summarizer: Seeing Through Machine Using Deep Learning Algorithm
No ratings yet
Image Summarizer: Seeing Through Machine Using Deep Learning Algorithm
7 pages
Fin Irjmets1681386363
No ratings yet
Fin Irjmets1681386363
5 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
Syntax: The Analysis of Sentence Structure
No ratings yet
Syntax: The Analysis of Sentence Structure
36 pages
ijariie26613
No ratings yet
ijariie26613
5 pages
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
No ratings yet
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
6 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Conjunction2 (Grammar)
No ratings yet
Conjunction2 (Grammar)
7 pages
Image Captioning Using CNN and LSTM
No ratings yet
Image Captioning Using CNN and LSTM
9 pages
article_111432
No ratings yet
article_111432
13 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
EDA
No ratings yet
EDA
11 pages
TLE Action Plan
No ratings yet
TLE Action Plan
3 pages
Cummins, J. (2000) - Language, Power and Pedagogy Bilingual Children in The Crossfire.
No ratings yet
Cummins, J. (2000) - Language, Power and Pedagogy Bilingual Children in The Crossfire.
22 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Presentation On Amos Rapoport: Theory of Architecture
No ratings yet
Presentation On Amos Rapoport: Theory of Architecture
11 pages
Erikson Erik Memorandum On Youth 1967
No ratings yet
Erikson Erik Memorandum On Youth 1967
12 pages
2597.full
No ratings yet
2597.full
7 pages
02
No ratings yet
02
8 pages
2502.04984v1
No ratings yet
2502.04984v1
9 pages
Image To Caption Generator
No ratings yet
Image To Caption Generator
7 pages
Grammar Games Like Ing Worksheet
No ratings yet
Grammar Games Like Ing Worksheet
2 pages
2502.10984v2
No ratings yet
2502.10984v2
8 pages
6719975bc17c8e2469d28744_59585697174
No ratings yet
6719975bc17c8e2469d28744_59585697174
2 pages
Automatic Image Caption Generation System
No ratings yet
Automatic Image Caption Generation System
4 pages
The To Hypnosis E-Book: by Rory Z Fulcher
No ratings yet
The To Hypnosis E-Book: by Rory Z Fulcher
16 pages
Perception 1
No ratings yet
Perception 1
20 pages
Anchor Thoughts Tool: Preparing Yourself For Success
No ratings yet
Anchor Thoughts Tool: Preparing Yourself For Success
7 pages
Ramesh Balsekar
No ratings yet
Ramesh Balsekar
15 pages
Health Education Midterms
100% (2)
Health Education Midterms
3 pages
2.2.2 Hypotheses and Inference
No ratings yet
2.2.2 Hypotheses and Inference
6 pages
MGMT3012 - CourseOutline Revised 2022 S4
No ratings yet
MGMT3012 - CourseOutline Revised 2022 S4
6 pages
2004 Robertson - Awareness Modifies The Skill-Learning Benefits of Sleep
No ratings yet
2004 Robertson - Awareness Modifies The Skill-Learning Benefits of Sleep
5 pages
Paradigmatic Sense Relations of Inclusion and Identity
100% (1)
Paradigmatic Sense Relations of Inclusion and Identity
4 pages
Evaluating Competene in Psychotherapy
No ratings yet
Evaluating Competene in Psychotherapy
9 pages
Art and Science Unit Plan Grade 6
No ratings yet
Art and Science Unit Plan Grade 6
8 pages
Roadmap For Learning Portuguese
No ratings yet
Roadmap For Learning Portuguese
2 pages
Game UI - Illustrator/Animator: Job Description - Product Design at MPL
No ratings yet
Game UI - Illustrator/Animator: Job Description - Product Design at MPL
2 pages
Mentalhealthservices Mhs Activeinterventions
No ratings yet
Mentalhealthservices Mhs Activeinterventions
2 pages
Babok v3
No ratings yet
Babok v3
2 pages
Pera Pera Penguin 66
No ratings yet
Pera Pera Penguin 66
1 page
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet