It is a photo caption generator which generates text describing the given image .
- First features from the image dataset are extracted using a pretrained model (VGG16) in this case and stored in a file called 'features.pkl'.This is done in the file 'features.py'
2) Then the text is cleaned so that it is easier for the model to learn . The cleaning involves removing all the punctuations ,converting all uppercase letters to lower case letters removing words having length of one letter and then the description of each image is stored in 'descriptions.txt' . This is done in 'text.py'
3) For the model to generate we need a first kick off word and for the senntence to end we need a last kickoff word therefore we add "startseq" in the beginning of the string and "endseq" at the end of string .Each and every word is assigned a number and hence every sentence is converted into a vector. This is done using keras inbuilt Tokenizer . The created tokenizer is then stored in "tokenizer.pkl". This is done in "tok.py"
4)Then sequences are created because we know for the lstm to work we need to divide the sentence into prefix arrays
. For eg The sentence "startseq dog is running through the grass endseq" is divided into
x1 x2 y
photo startseq dog
photo startseq dog is
photo startseq dog is running
photo startseq dog is running through
photo startseq dog is running through the
photo startseq dog is running through the grass
photo startseq dog is running through the grass endseq
Then it is converted into sentence vector using the previously created tokenizer and finally it is fed into the neural network
The output of the VGG16 is a 4096 vector which is processed by a dense layer of size 256 to give an output of 256 . The language model expects a vector of size 34 and which are fed into Embedding layer which outputs a vector of size 256 which is fed into a decoder and a final Dense layer of size 256 is added with activation function as softmax that makes the final prediction
Python 3
Keras(2.2.4)(Gpu-Version with Cuda and CuDnn installed)
Tensorflow(1.9.0)
Numpy
Graphics Card( GE-Force 1050 Ti 4gb)
RAM-16gb
-
Clone the repository
-
Change directory to the directory where the file generate_caption.py is located
-
Download the pretrained model and place in the current working directory.
-
To generate a caption , enter the following command:
python generate_caption.py /path/to/image/
Trained model can be found at model
CS 231n-http://cs231n.stanford.edu/reports/2016/pdfs/362_Report.pdf
Andrej Karpathy Talk-https://cs.stanford.edu/people/karpathy/sfmltalk.pdf
Machine Learning Mastery-https://machinelearningmastery.com/