Skip to content

themechanicalcoder/Image-Captioning

Repository files navigation

Image-Captioning-Keras

It is a photo caption generator which generates text describing the given image .

How it works

  1. First features from the image dataset are extracted using a pretrained model (VGG16) in this case and stored in a file called 'features.pkl'.This is done in the file 'features.py'

2) Then the text is cleaned so that it is easier for the model to learn . The cleaning involves removing all the punctuations ,converting all uppercase letters to lower case letters removing words having length of one letter and then the description of each image is stored in 'descriptions.txt' . This is done in 'text.py'

3) For the model to generate we need a first kick off word and for the senntence to end we need a last kickoff word therefore we add "startseq" in the beginning of the string and "endseq" at the end of string .Each and every word is assigned a number and hence every sentence is converted into a vector. This is done using keras inbuilt Tokenizer . The created tokenizer is then stored in "tokenizer.pkl". This is done in "tok.py"

4)Then sequences are created because we know for the lstm to work we need to divide the sentence into prefix arrays
. For eg The sentence "startseq dog is running through the grass endseq" is divided into
   x1            x2                                                                  y
   photo         startseq                                                           dog
   photo         startseq dog                                                       is
   photo         startseq dog is                                                    running
   photo         startseq dog is running                                            through
   photo         startseq dog is running through                                    the                      
   photo         startseq dog is running through the                                grass
   photo         startseq dog is running through the grass                          endseq

Then it is converted into sentence vector using the previously created tokenizer and finally it is fed into the neural network
The output of the VGG16 is a 4096 vector which is processed by a dense layer of size 256 to give an output of 256 . The language model expects a vector of size 34 and which are fed into Embedding layer which outputs a vector of size 256 which is fed into a decoder and a final Dense layer of size 256 is added with activation function as softmax that makes the final prediction

Requirements

Python 3
Keras(2.2.4)(Gpu-Version with Cuda and CuDnn installed)
Tensorflow(1.9.0)
Numpy
Graphics Card( GE-Force 1050 Ti 4gb)
RAM-16gb

Network Structure

model;jpg

Using the Caption generator

  1. Clone the repository

  2. Change directory to the directory where the file generate_caption.py is located

  3. Download the pretrained model and place in the current working directory.

  4. To generate a caption , enter the following command:

     python generate_caption.py /path/to/image/
    

Result

Image

390671130_09fdccd52f

Generated Text-startseq dog is running through the grass endseq

Image

bike

Generated Text-startseq man in red helmet is riding bike endseq

Image

fight

Generated Text-startseq two men are playing soccer on the grass endseq

Image

play

Generated text - startseq two girls are playing instruments endseq

Trained model can be found at model

References

CS 231n-http://cs231n.stanford.edu/reports/2016/pdfs/362_Report.pdf
Andrej Karpathy Talk-https://cs.stanford.edu/people/karpathy/sfmltalk.pdf
Machine Learning Mastery-https://machinelearningmastery.com/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages