0% found this document useful (0 votes)
17 views8 pages

Implementation_of_Simple_and_Efficient_P

The research article presents a system for generating image captions using a combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. The proposed method processes images from the Flickr8k dataset, extracting features with CNN and generating natural language captions with LSTM, demonstrating its potential for applications in various fields. Results indicate that while the system shows promise, accuracy can be improved with larger datasets.

Uploaded by

darbyava27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

Implementation_of_Simple_and_Efficient_P

The research article presents a system for generating image captions using a combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. The proposed method processes images from the Flickr8k dataset, extracting features with CNN and generating natural language captions with LSTM, demonstrating its potential for applications in various fields. Results indicate that while the system shows promise, accuracy can be improved with larger datasets.

Uploaded by

darbyava27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

11

Research Article
Volume-1 | Issue-1| Jan-Jun-2024|
JOURNAL OF
Artificial Intelligence and
Imaging
Double Blind Peer Reviewed Journal
DOI: https://doi.org/10.48001/JoAII

Implementation of Simple and Efficient Picture Caption


Generator

Vijay Mane1*, Riddhi Selkar1


1
Department of Electronics and Telecommunication Engineering, Vishwakarma Institute of Technology, Pune,
Maharashtra, India
*
Corresponding Author’s Email: vijay.mane@vit.edu

ARTICLE HISTORY: ABSTRACT: Image captioning or picture captioning has become one of the most widely
used technologies in applications that generate and provide captions for specific
Received: 8th Jan, 2024
photographs. All these things are done with the help of deep neural networks. It identifies
Revised: 25th Jan, 2024
the specific objects in an image and their attributes and relationships. The purpose of this
Accepted: 8th Feb, 2024 research is to find different things in a photograph, figure out their relationships, and
Published: 20th Feb, 2024 write captions. The proposed system is implemented on dataset Flickr8k along with
python. The input images are pre-processed and then features from images are extracted
KEYWORDS:
using CNN. To translate the features and objects extracted by CNN to a natural sentence in
Caption, CNN English LSTM is utilized in the implementation. Different types of images are tested with
(Convolutional Neural
the proposed system. The results are presented with the generated image captions. The
Networks), Deep learning,
results presented shows the accuracy of the system. The presented method has potentials
Image captioning, LSTM
(Long Short-Term for such applications where image captioning is essential.
Memory)

1. INTRODUCTION poses a formidable challenge in automatically


With captions for every photograph on the internet, true summarizing image content using well-constructed English
photograph discovery and indexing can be done faster and words. Despite its complexity, this task holds significant
more accurately. Image captioning is used in a wide range promise, particularly in aiding visually impaired
of fields, including medical, business, internet search, and individuals to gain a clearer comprehension of the visual
the military, to name a few. Captions can be generated content present in web images (Hossain et al., 2019;
automatically on Instagram, Facebook, and other social Alahmadi et al., 2019).
media platforms (Wang et al., 2020; Sharma et al., 2019).
The main goal of this study is to get a fundamental
understanding of deep learning approaches. Employing
both Convolutional Neural Networks (CNN) and Long
Short-Term Memory (LSTM) for image classification

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)


12

1.2 LSTM

It is a type of Recurrent Neural Network (RNN) which can


learn dependence in the order while we deal with
prediction problems (Bai & An, 2018). This is mostly used
in complex situations such as Machine Translation, Speech
Recognition, etc. LSTM addresses the shortcomings of
traditional Recurrent Neural Networks (RNN) by
effectively preserving pertinent information across input
processing, while filtering out extraneous details. After
analyzing images by CNN, the meaningful sentences can
be generated using LSTM.
Figure 1: Input and Output.

1.1 CNN

Convolutional Neural networks important sensory systems


can generate information in a specific format, such as 2D
image. When it comes to working with images, lattice and
CNN are helpful (Mathur et al., 2017; Pranay et al., 2017).

Figure 3: LSTM.

1.3 CNN-LSTM Architecture

The CNN-LSTM architecture integrates Long Short-Term


Memory (LSTM) units for predicting output word
sequences with Convolutional Neural Network (CNN)
layers, which play a role in extracting features from the
input data.

Figure 2: CNN-LSTM Architecture.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)


13

related to one another (that's just basic image


classification). In this study, we develop a generative
model that uses RNN to efficiently produce meaningful
sentences.

In this study, we have adopted the ZigBee network as the


foundational framework for our system. We introduce a
home automation system centered around a digital door lock,
capitalizing on the extensive capabilities of the ZigBee
sensor network to seamlessly integrate home security and
automation. Within our proposed system, a ZigBee module is
intricately integrated into the digital door lock, serving as the
central main controller for the entire setup. This innovative
design aims to maximize the potential of ZigBee technology
in harmonizing home security and automation functionalities.

• Image based Model - CNN - Extracts the features


from the image (Albawi et al., 2017).

• Language based Model - LSTM - It converts the


features and objects identified by the Convolutional
Neural Network (CNN) into a coherent English
sentence.

3.1 Dataset

• We use Flickr 8K dataset having 8000 images which


has 5 captions for each image in the collection.

Figure 4: Flowchart. • A Flickr 8k training dataset.trainImages.txt has 6,000


images.

• A 1,000-image development dataset called Flickr


2. LITERATURE SURVEY
8k.devImages.txt is available.
Looked into the issue of using natural language processing
• A test collection of data from Flickr 8k.The dataset
to describe visual data automatically. It uses a visual
1,000 photos includes all testImages.txt.
primitive recognizer and formal language, for example, to
convert visual to text. The conversion of And-Or Graphs or 3.2 Methodology
logical systems into natural language is typically achieved
• Preprocessing of image.
through rule-based systems. However, these methods
prove to be less effective and find limited application, • Creating vocabulary for the image.
primarily in specific domains like sports and traffic a. Adding data to the load.
scenarios. In contrast, the field of representing graphical
information through natural language processing has b. Creating a lexicon of descriptions that maps images.
experienced a surge in interest and popularity in recent c. Eliminating punctuation, changing all text to
years. Natural language processing techniques have lately lowercase, and deleting words with numbers in them.
been used to identify objects based on their characteristics
d. Sorting out all the distinctive words and constructing
and locations (Wu & Zhou, 2019; Bahdanau et al., 2014).
a vocabulary from each description.
3. SYSTEM DESCRIPTION
e. Making a descriptions.txt file to keep track of all the
This model was helpful in identifying the objects in an captions.
image, but it was unable to tell us how those objects
DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)
14

• Train the dataset.

• Evaluating the model.

• Testing on individual images.

3.3 Preprocessing

Images and their related captions are processed


independently. Input data is entered into the Xception
Keras API application, which works beyond TensorFlow,
to create an image. ImageNet was previously trained in
Xception. With the help of transfer learning, we were able
to train the images quickly. Definitions are cleared using
the Keras' tokenizer section, which displays the text and
saves it in its dictionary. Then, for each word in the
vocabulary, a unique reference number is assigned.

4. RESULTS

Figure 6: Image 2.

Figure 7: Image 3.
Figure 5: Image 1.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)


15

Figure 10: Image 6.

Figure 8: Image 4.

Figure 9: Image 5. Figure 11: Image 7.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)


16

Figure 12: Image 8.


Figure 14: Image 10.

Figure 13: Image 9.

Figure 15: Image 11.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)


17

Table 1: Comparison Between Original and Predicted Values.

Image Original Description Predicted Description

Image1 White crane is standing in the water White crane is flying over the water

Men in red shirt and black pants is walking Men in red shirt and black pants is walking down the
Image2
down the snowy hill snowy hill

Image3 Man is snowboarding on the side of mountain Man is snowboarding on the side of mountain

Image4 Man is standing on the rock Man in red shirt is standing on the rock

Image5 Man with red helmet is ridding bike on road Man in red shirt is riding bike on the side of road

Young boy is jumping into the air to climb into the


Image6 Young girl is playing in water
water

Image7 Ship is standing in the water Man in red kayak is walking on the beach

Image8 Man is kayaking in the water Man is kayaking in the water

Image9 Man in red shirt is sitting on the bench Man in red shirt is walking on the street

Image10 5 people are standing on grass Man in the red shirt walking on the street

Image11 Brown dog is running through the grass Brown dog is running through the grass

5. CONCLUSION so it cannot predict the words that are out of its vocabulary.
We could try other algorithms and methodologies for
The paper presents the implementation of image caption
increasing the accuracy of generating captions. For the
generator using the CNN-LSTM. This field has an
computer to speak every created caption, we may also
increasing rate for implementing applications for example
include a text-to-speech converter for applications used by
cases in Computer Vision(CV) and NLP domains.
blind people. The outcomes are displayed alongside the
Accuracy of the model is less for generating captions
automatically generated image captions, showcasing the
sometimes it may generate wrong captions or incomplete
system's accuracy. The demonstrated method holds promise
captions. This is due to the small dataset. By using large
for applications where image captioning is a crucial
dataset having 100000 images, we can generate more
requirement.
accurate models. The model depends on the dataset we use,

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)


18

REFERENCES caption generator. In 2017 International Conference


on Computational Intelligence in Data Science
Alahmadi, R., Park, C. H., & Hahn, J. (2019, March).
(ICCIDS) (pp. 1-6). IEEE.
Sequence-to-sequence image caption generator.
https://doi.org/10.1109/ICCIDS.2017.8272660.
In Eleventh International Conference on Machine
Vision (ICMV 2018) (Vol. 11041, pp. 85-91). SPIE. Pranay, M., Aman, G., Aayush, Y., Anurag, M., & Nand,
https://doi.org/10.1117/12.2523174. B. (2017). Camera2Caption: A real-time image
caption generator. In International Conference on
Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017,
Computational Intelligence in Data Science(ICCIDS).
August). Understanding of a convolutional neural
https://doi.org/110.1109/ICCIDS.2017.8272660.
network. In 2017 International Conference on
Engineering and Technology (ICET) (pp. 1-6). IEEE. Sharma, G., Kalena, P., Malde, N., Nair, A., & Parkar, S.
https://doi.org/10.1109/ICEngTechnol.2017.8308186. (2019, April). Visual image caption generator using
deep learning. In 2nd International Conference on
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural
Advances in Science & Technology (ICAST).
machine translation by jointly learning to align and
http://dx.doi.org/10.2139/ssrn.3368837.
translate. arXiv preprint arXiv:1409.0473.
https://doi.org/10.48550/arXiv.1409.0473. Wang, H., Zhang, Y., & Yu, X. (2020). An overview of
image caption generation methods. Computational
Bai, S., & An, S. (2018). A survey on automatic image
Intelligence and Neuroscience, 2020.
caption generation. Neurocomputing, 311, 291-304.
https://doi.org/10.1155/2020/3062706.
https://doi.org/10.1016/j.neucom.2018.05.080.
Wu, Q., & Zhou, Y. (2019, May). Real-time object
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H.
detection based on unmanned aerial vehicle. In 2019
(2019). A comprehensive survey of deep learning for
IEEE 8th Data Driven Control and Learning Systems
image captioning. ACM Computing Survey, 51(6), 1-
Conference (DDCLS) (pp. 574-579). IEEE.
36. https://doi.org/10.1145/3295748.
https://doi.org/10.1109/DDCLS.2019.8908984
Mathur, P., Gill, A., Yadav, A., Mishra, A., & Bansode, N.
K. (2017, June). Camera2Caption: A real-time image
.

DOI: https://doi.org/10.48001/JoAII.2023.1111-18 Copyright (c) 2024 QTanalytics India (Publications)

You might also like