0% found this document useful (0 votes)
93 views

PDF Image Caption Technical Report - Compress

This document provides a capstone project report for developing an Artificial Intelligence System for Humans. The project aims to develop a digital assistant that can generate descriptive captions for images using neural language models and respond to user queries in speech form. Deep learning techniques like convolutional neural networks and recurrent neural networks are used for image captioning, text-to-speech conversion, and speech-to-text conversion. The report includes sections on introduction, intended audience, project scope, literature review, requirement analysis, and design specifications.

Uploaded by

Tele boy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

PDF Image Caption Technical Report - Compress

This document provides a capstone project report for developing an Artificial Intelligence System for Humans. The project aims to develop a digital assistant that can generate descriptive captions for images using neural language models and respond to user queries in speech form. Deep learning techniques like convolutional neural networks and recurrent neural networks are used for image captioning, text-to-speech conversion, and speech-to-text conversion. The report includes sections on introduction, intended audience, project scope, literature review, requirement analysis, and design specifications.

Uploaded by

Tele boy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Artificial Intelligence System for Humans

Capstone Project Report

Submitted by:
(101403022) Ajay Kumar Chhimpa
(101403023) Akash Gupta
(101403024) Akash Kumar Sikarwar
(101583005) Ayush Garg

BE Third Year, CSE


Lab Group: COE1, Project Team No. _____

Under the Mentorship of 


Dr. Sanmeet Kaur
Assistant professor, CSED, Thapar University

Computer Science and Engineering Department


Thapar University, Patiala
May & 2017
Introduction

Aim

This aim of this project is to develop a Digital assistant that can generate descriptive
captions for images using neural language models. A Digital ass istant help the user to provide
answer to his questions which would be given in speech form as a command.

Intended audience
This project can act as vision for the visually impaired people, as it can identify nearby
objects through the camera and give the output in audio form. The app provides a highly
interactive platform for the specially abled people

Project Scope

The goal is to design an android application which covers all the functions of image
description and provides an interface of Digital assistant to the user. A Digital assistant
help the user to provide answer to his questions
questi ons which would be given in speech form as a
command.
By using Deep learning techniques the project performes:

 Image Captioning: Recognising Different types of objects in an image and creating a

meaningful sentence that describes that image to visually impaired persons.


 Text to speech conversion.

 Speech to text conversion and Identifying result for users quer y.

the approach used in carrying out the project objectives

Deep learning is used extensively to recognize i mages and to generate captions. In


 particular, Convolutional Neural network is used to recognize objects in an image and a
variation of Recurrent Neural network, Long short term memory (LSTM) is used t o
generate sentences.
Gantt Chart:
Literature Review
Generating captions for images is a very intriguing task lying at the intersection of the
areas of Computer vision and Natural Language Processing. This task is central to the
 problem of understanding a scene.

The purpose of this model is to encode the visual information from an image and semantic

information from a caption, into a embedding space; this embedding space has the property

that vectors that are close to each other are visually or semantically related. For a batch of

images and captions, we can use the model to map them all into this embedding space,

compute a distance metric, and for each image and for each caption find its nearest neighbors.

If you rank the neighbors by which examples are closest, you have ranked how relevant

images and captions are to each other.


Requirement Analysis:
Use Case Diagram:

Use Case Templates:

Use Case: User Login

Id: UC- <UC-001>

Description:
User enter the username and password for authentication.
Level: Low Level

Primary Actor:
Application User

Pre-Conditions:
 User should be registered.
 User should have entered the username and password.

Post Conditions:

Success end condition:


Successfully authenticate the user.

Failure end condition:


 User’s username may be incorrect.
 User’s password may be incorrect.
 User may not be registered.

Minimal Guarantee:
User’s username and password is encrypted.

Trigger:
Unauthorized user opens the app.

Main Success Scenario


1. Open the app.

2. If not logged in,

a. then enter username and password.

 b. hit login.

otherwise, user is automatically logged in.

Frequency:
Once,unless logged out.

Use Case: User registration


Id: UC- <UC-002>

Description:
User makes account in the application.

Level: Low Level

Primary Actor:
Application User

Pre-Conditions:
 App is opened and any other user is not logged in.
 The user information is valid in registration form.

Post Conditions

Success end condition:


Successfully registered the user.

Failure end condition:


User does not get registered.

Minimal Guarantee:
 Only through valid details the user gets registered.
 Two users can’t register with same username.

Trigger:
Unauthorized user opens the app.

Main Success Scenario


1. Open the app.

2. If not logged in,

i.Hit Create Account button.

ii.Enter details.

iii.Hit Register.

otherwise, Logout the current user and follow step 2 above.

Frequency:
Once,unless another user wants to create account.
Use Case: Image upload by the user.

Id: UC- <UC-003>

Description:
User selects a particular Image from the Phone Gallery or Clicks the image through
Camera.

Level: User Goal

Primary Actor:
Application User

Pre-Conditions:
User must be logged in.

Post Conditions

Success end condition:


Selected a valid Image file which gets uploaded successfully.

Failure end condition:


Selected an invalid file due to which file will not get uploaded.

Minimal Guarantee:
The file will only get uploaded if it’s valid.

Trigger
User starts the Image Captioning process by clicking the Image Captioning button.

Main Success Scenario


1. Open the app.

2. Click Image Captioning Button.

3. Upload the Image successfully.

Frequency:
About 10 times per hour.
Section 4: Design Specifications

Flowchart of the proposed system


References

[1] COLLOBERT, R., W ESTON, J., BOTTOU, L., K ARLEN, M., K AVUKCUOGLU, K.,

AND K UKSA, P. Natural language processing (almost) from scratch. The Journal of Machine

Learning Research 12 (2011), 2493 – 2537.

[2] KARPATHY. The Unreasonable Effectiveness of Recurrent Neural Networks. http:


//karpathy.github.io/2015/05/21/rnn-effectiveness/.

[3] K ARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image
descriptions. arXiv preprint arXiv:1412.2306 (2014).

[4] K IROS, R., SALAKHUTDINOV, R., AND Z EMEL, R. Multimodal neural language models.
In Proceedings of the 31st International Conference on Machine Learning (ICML-14)
(2014), T. Jebara and E. P. Xing, Eds., J MLR Workshop and Conference Proceedings,
 pp. 595 – 603.

[5] SIMONYAN, K., AND Z ISSERMAN, A. Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2014).

You might also like