PDF Image Caption Technical Report - Compress
PDF Image Caption Technical Report - Compress
Submitted by:
(101403022) Ajay Kumar Chhimpa
(101403023) Akash Gupta
(101403024) Akash Kumar Sikarwar
(101583005) Ayush Garg
Aim
This aim of this project is to develop a Digital assistant that can generate descriptive
captions for images using neural language models. A Digital ass istant help the user to provide
answer to his questions which would be given in speech form as a command.
Intended audience
This project can act as vision for the visually impaired people, as it can identify nearby
objects through the camera and give the output in audio form. The app provides a highly
interactive platform for the specially abled people
Project Scope
The goal is to design an android application which covers all the functions of image
description and provides an interface of Digital assistant to the user. A Digital assistant
help the user to provide answer to his questions
questi ons which would be given in speech form as a
command.
By using Deep learning techniques the project performes:
The purpose of this model is to encode the visual information from an image and semantic
information from a caption, into a embedding space; this embedding space has the property
that vectors that are close to each other are visually or semantically related. For a batch of
images and captions, we can use the model to map them all into this embedding space,
compute a distance metric, and for each image and for each caption find its nearest neighbors.
If you rank the neighbors by which examples are closest, you have ranked how relevant
Description:
User enter the username and password for authentication.
Level: Low Level
Primary Actor:
Application User
Pre-Conditions:
User should be registered.
User should have entered the username and password.
Post Conditions:
Minimal Guarantee:
User’s username and password is encrypted.
Trigger:
Unauthorized user opens the app.
Frequency:
Once,unless logged out.
Description:
User makes account in the application.
Primary Actor:
Application User
Pre-Conditions:
App is opened and any other user is not logged in.
The user information is valid in registration form.
Post Conditions
Minimal Guarantee:
Only through valid details the user gets registered.
Two users can’t register with same username.
Trigger:
Unauthorized user opens the app.
ii.Enter details.
iii.Hit Register.
Frequency:
Once,unless another user wants to create account.
Use Case: Image upload by the user.
Description:
User selects a particular Image from the Phone Gallery or Clicks the image through
Camera.
Level: User Goal
Primary Actor:
Application User
Pre-Conditions:
User must be logged in.
Post Conditions
Minimal Guarantee:
The file will only get uploaded if it’s valid.
Trigger
User starts the Image Captioning process by clicking the Image Captioning button.
Frequency:
About 10 times per hour.
Section 4: Design Specifications
[1] COLLOBERT, R., W ESTON, J., BOTTOU, L., K ARLEN, M., K AVUKCUOGLU, K.,
AND K UKSA, P. Natural language processing (almost) from scratch. The Journal of Machine
[3] K ARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image
descriptions. arXiv preprint arXiv:1412.2306 (2014).
[4] K IROS, R., SALAKHUTDINOV, R., AND Z EMEL, R. Multimodal neural language models.
In Proceedings of the 31st International Conference on Machine Learning (ICML-14)
(2014), T. Jebara and E. P. Xing, Eds., J MLR Workshop and Conference Proceedings,
pp. 595 – 603.
[5] SIMONYAN, K., AND Z ISSERMAN, A. Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2014).