PROJECT DOC-FILE
PROJECT DOC-FILE
ON
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR
IMAGE CAPTION GENERATION
Submitted in partial fulfilment of the requirement for the award of the Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING(AI & ML)
BY
P.SAI PRAKASH 20N71A6628
S.HARIKA 20N71A6634
N.SHRAVAN KUMAR 21N75A6604
i
DRK INSTITUTE OF SCIENCE AND TECHNOLOGY
CERTIFICATE
This is to certify that the project report entitled “AN EFFICIENT DEEP LEARNING
BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION”
P.SAI PRAKASH (20N71A6628), S.HARIKA(20N71A6634), N.SHRAVAN KUMAR
(21N75A6604) or the partial fulfillment of the requirement for the award of B.Tech. Degree
in COMPUTER SCIENCE AND ENGINEERING(AI & ML), JNTUH University
Hyderabad, for the academic year 2023-2024
EXTERNAL EXAMINER
ii
DECLARATION
We here declare that the project report entitled “AN EFFICIENT DEEP LEARNING
BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION” submitted to the
Department of COMPUTER SCIENCE AND ENGINEERING(AI & ML) in partial
fulfillment of requirements for the award of the degree of BACHELOR OF
TECHNOLOGY. This project is the result of our own effort and that it has not been submitted
to any other University or Institution for the award of any degree or diploma other than
specified above.
S.HARIKA 20N71A6634
iii
ACKNOWLEDGMENT
The report would not be complete without mentioning certain individuals whose
guidance and encouragement have been of immense help to complete this thesis.
We express a deep sense of gratitude to our guide Mr. K.PRAVEEN, Assistant
Professor, Department of CSE(AI & ML), for her able guidance and cooperation
throughout our project. We are highly grateful to her for providing all the facilities for
the completion of the project work.
We are very thankful to Mr. K.PRAVEEN, Head of the Department CSE, for
providing the necessary resources for successful completion of the project work.
We would like to thank our parents and friends, who have the greatest
contributions in all our achievements, for the great care and blessings in making us
successful in all our endeavor's
S.HARIKA 20N71A6634
iv
ABSTRACT
In the recent years, with the increase in the use of different social media platforms, image
captioning approach play a major role in automatically describe the whole image into natural
language sentence. Image captioning plays a significant role in computer-based society. Image
captioning is the process of automatically generating the natural language textual description
of the image using artificial intelligence techniques. Computer vision and natural language
processing are the key aspect of the image processing system. Convolutional Neural Network
(CNN) is a part of computer vision and used object detection and feature extraction and on the
other side Natural Language Processing (NLP) techniques help in generating the textual
caption of the image. Generating suitable image description by machine is challenging task as
it is based upon object detection, location and their semantic relationships in a human
understandable language such as English. In this paper our aim to develop an encoder-decoder
based hybrid image captioning approach using VGG16, ResNet50 and YOLO. VGG16 and
ResNet50 are the pre-trained feature extraction model which are trained on millions of images.
YOLO is used for real time object detection. It first extracts the image features using VGG16,
ResNet50 and YOLO and concatenate the result in to single file. At last LSTM and BiGRU are
used for textual description of the image. Proposed model is evaluated by using BLEU,
METEOR and RUGE score.
v
TABLE OF CONTENTS
CONTENTS PAGE.NO
1. INTRODUCTION 1
2. SYSTEM SPECIFICATIONS 2
2.1 HARDWARE REQUIREMENT 2
2.2 SOFTWARE REQUIREMENTS 2
3. SOFTWARE AND HARDWARE SPECIFICATIONS 3
3.1 REQUIREMENT ANALYSIS 3
3.2 REQUIREMENT SPECIFICATIONS 3
3.2.1 Functional Requirements 3
3.2.2 Software Requirements 3
3.2.3 Hardware Requirements 3
4. LITERATURE SURVEY 4
5. SYSTEM ANALYSIS 6
5.1 EXISTING SYSTEM 6
5.2 PROPOSED SYSTEM 8
6. MODULES 10
6.1 MODULES 10
6.2 MODULES DESCRIPTION 10
7. SYSTEM DESIGN 12
7.1 SYSTEM ARCHITECTURE 12
7.2 DATA FLOW DIAGRAM 12
7.3 UML DIAGRAM 13
7.3.1 USE CASE DIAGRAM 14
7.3.2 CLASS CASE DIAGRAM 15
7.3.3 SEQUENCE DIAGRAM 16
7.3.4 ACTIVITY DIAGRAM 18
vi
CONTENTS PAGE.NO
8. SOURCE CODE 19
9. SYSTEM STUDY 29
9.1 FEASIBILITY STUDY 29
9.1.1 ECONOMICAL FEASIBILITY 29
9.1.2 TECHNICAL FEASIBILITY 29
9.1.3 SOCIAL FEASIBILITY 30
10. SYSTEM TEST 31
10.1 TYPES OF TESTS 31
10.2 TEST CASES 34
11. OUTPUT SCREENS 35
12. CONCLUSION 39
13. FURTHER ENHANCEMENTS 40
14. REFERENCES 41
vii
LIST OF FIGURES
viii
LIST OF OUTPUT FIGURES
OUTPUT SCREEN NAMES PAGE.NO
ix
CHAPTER – 1
INTRODUCTION
In this www world, every day in our life, all have experienced with the huge number of images
in a real world which are self-interpret by the individual human being by using their wisdom.
Human are naturally programmed to convert the natural scene in to text but it is the complex
task for the machine as they are not much efficient like human. Still, human generated captions
are considered better as machine need human intervention and programmed accordingly for
the better result. Due to the recent development in deep learningbased techniques, computers
are capable to handle the challenges of image captioning like detection of object, attribute and
their relationship, image feature extraction and generating syntactic and semantic image
caption [1]. With the advancement of AI, so many new ideas have revolutionized in the areas
of image processing and it has transformed the world in a surprising way. The image captioning
Approach (Fig. 1) has wider application in the real world as it provides the better platform for
human computer interaction. Due to the emerging application in image processing, image
captioning becomes the topic of interest for the academician and researchers. By seeing the
Fig. 2, picture someone guess that two dogs are playing with toy and someone might say two
dogs hauling in floating toy from the ocean or two dogs run through the water with rope in their
mouths, so all of these captions are appropriate to describe this picture. Our brain is so much
trained and advanced that it can describe a picture almost accurate but same was not the case
with machines. Hence, the main aim of the image captioning is first identified the different
objects and their relationship present in the image using deep learning-based technique,
generating the textual description using the natural language processing and evaluate the
performance of the natural language-based description using different performance matrices.
Object detection and segmentation are the part of the computer vision and done with the help
of popular CNN and DNN and generating image description (Fig. 3) are the part of natural
language processing which is done by RNN and LSTM. CNN works for understanding the
objects of the image or scene and provide the answers the various questions about the objects
in image like what, where, how, etc.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
1
CHAPTER - 2
SYSTEM SPECIFICATIONS
❖ System : Intel i3
❖ RAM : 4GB.
❖ Designing : Html,css,javascript.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
2
CHAPTER - 3
SOFTWARE AND HARDWARE SPECIFICATIONS
The project involved analyzing the design of few applications so as to make the
application more users friendly. To do so, it was really important to keep the navigations from
one screen to the other well ordered and at the same time reducing the amount of typing the
user needs to do. In order to make the application more accessible, the browser version had to
be chosen so that it is compatible with most of the Browsers.
1. Python
2. Django
1. Windows 10 64 bit OS
1. Python
For developing the application the following are the Hardware Requirements:
▪ Processor: Intel i3
▪ RAM
▪ Space on Hard Disk: minimum 1TB
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
3
CHAPTER - 4
LITERATURE SURVEY
Language Models based on recurrent neural networks have dominated recent image caption
generation tasks. In this paper, we introduce a Language CNN model which is suitable for
statistical language modeling tasks and shows competitive performance in image captioning.
In contrast to previous models which predict next word based on one previous word and hidden
state, our language CNN is fed with all the previous words and can model the long-range
dependencies of history words, which are critical for image captioning. The effectiveness of
our approach is validated on two datasets MS COCO and Flickr30K. Our extensive
experimental results show that our method outperforms the vanilla recurrent neural network
based language models and is competitive with the state-of-the-art methods.
Image captioning is an important but challenging task, applicable to virtual assistants, editing
tools, image indexing, and support of the disabled. Its challenges are due to the variability and
ambiguity of possible image descriptions. In recent years significant progress has been made
in image captioning, using Recurrent Neural Networks powered by long-short-term-memory
(LSTM) units. Despite mitigating the vanishing gradient problem, and despite their compelling
ability to memorize dependencies, LSTM units are complex and inherently sequential across
time. To address this issue, recent work has shown benefits of convolutional networks for
machine translation and conditional image generation. Inspired by their success, in this paper,
we develop a convolutional image captioning technique. We demonstrate its efficacy on the
challenging MSCOCO dataset and demonstrate performance on par with the baseline, while
having a faster training time per number of parameters. We also perform a detailed analysis,
providing compelling reasons in favor of convolutional language generation approaches.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
4
LITERATURE SURVEY
Image captioning is a fundamental task which requires semantic understanding of images and
the ability of generating description sentences with proper and correct structure. In
consideration of the problem that language models are always shallow in modern image caption
frameworks, a deep residual recurrent neural network is proposed in this work with the
following two contributions. First, an easy-to-train deep stacked Long Short Term Memory
(LSTM) language model is designed to learn the residual function of output distributions by
adding identity mappings to multi-layer LSTMs. Second, in order to overcome the over-fitting
problem caused by larger-scale parameters in deeper LSTM networks, a novel temporal
Dropout method is proposed into LSTM. The experimental results on the benchmark
MSCOCO and Flickr30K datasets demonstrate that the proposed model achieves the state-of-
the-art performances with 101.1 in CIDEr on MSCOCO and 22.9
In B-4 on Flickr 30K, respectively.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
5
CHAPTER - 5
SYSTEM ANALYSIS
• MFCC + Softmax Regression: Extract MFCC features, feed into softmax regression
model for genre classification.
• CQT + Softmax Regression: Use Constant Q Transform instead of STFT to get
spectrogram features, feed into softmax regression.
• FFT + Softmax Regression: Take FFT directly on audio, feed amplitude spectrum into
softmax regression.
• MFCC + MLP: Use MFCC as input, feed into a multilayer perceptron (MLP) model
with softmax output for classification.
• CQT + MLP: Use CQT spectrogram as input, feed into MLP model.
• FFT + MLP: Use FFT amplitude spectrum as input, feed into MLP.
But they did not use convolutional neural networks or other deep learning approaches. The
input features were hand-engineered rather than learned.
Let me know if you need any clarification on these existing systems! I tried to infer the
details from the limited information provided in the paper.
Based on the typical audio feature extraction and classification approaches used in the existing
systems described in the paper, some potential disadvantages or limitations could be:
• Hand-crafted audio features like MFCC may not capture all the relevant information
for genre classification. They are engineered based on human assumptions rather than
learned from data.
• Features like MFCC are extracted from short frames independently, without
considering temporal context. This ignores useful temporal patterns in the audio.
• Simple linear models like softmax regression have limited modeling capacity to
capture complex patterns in audio features.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
6
SYSTEM ANALYSIS
• Non-linear MLPs are able to model complex patterns, but their performance still
relies on the quality of input features.
• Most systems use a pipeline approach - feature engineering, feature selection, then
classifier training. This is not end-to-end learning.
• Lack of shift/translation invariance - small variations in pitch or tempo can degrade
accuracy of systems relying on fixed audio features.
• Unable to effectively learn from raw audio - most systems rely on engineered features
rather than learning directly from spectrograms/waveforms.
• Inability to scale up - unlike deep learning approaches, traditional methods can't
benefit from larger datasets.
The key limitations are reliance on engineered features rather than end-to-end feature
learning, lack of modeling temporal context, limited invariance properties, and disjoint
training of feature extraction and classifier components. Deep learning approaches can help
overcome some of these disadvantages.
Algorithm:
Here are some of the key existing algorithms and techniques that were used prior to this
work:
• Using hand-crafted audio features like MFCCs, chroma features, spectral contrast, etc
and feeding them into machine learning classifiers like SVM, KNN, Random Forests
etc.
• Using aggregation and statistics of low-level features, e.g. mean, variance, histograms
etc.
• Applying dimensionality reduction on hand-crafted features like PCA, ICA etc before
classification.
• Using mid-level representations like bag-of-words on audio features.
• Combining multiple features at feature-level or decision-level via techniques like
feature concatenation, early fusion, late fusion etc.
• Using deep neural networks like Deep Belief Networks (DBNs) and stacked
autoencoders for unsupervised pre-training before classification.
• Applying recurrent neural networks like LSTMs on top of pre-extracted features for
sequence modeling.
• Using 1D convolutional neural networks on raw waveform or spectrogram for feature
learning.
The key existing techniques relied heavily on hand-crafted audio features or 1D convolution,
rather than 2D convolutional feature learning directly from spectrograms as proposed in this
paper. The deep learning approaches focused more on unsupervised pre-training rather than
end-to-end feature learning.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
7
SYSTEM ANALYSIS
Here is the key points about the music genre classification paper:
• Motivation: Develop better feature representations directly from audio rather than
using hand-crafted features like MFCCs for music genre classification.
• Approach: Use 2D convolutional neural network applied on spectrograms to learn
features that capture timbral and temporal patterns.
• Input: 30-second audio clips converted to spectrograms using Short-time Fast Fourier
Transform (STFT).
• Feature Learning: Designed 4 filters to detect patterns related to percussion, harmony,
pitch slides etc. Convolved filters with spectrogram to obtain 4 feature maps.
• Subsampling: Applied 2x2 max pooling on feature maps for dimensionality reduction
and translation invariance.
• Classification: Flattened feature maps and fed them into a Multilayer Perceptron
(MLP) with softmax output for 10-way genre classification.
• Results: Achieved 72.4% accuracy on GTZAN dataset, outperforming MFCC+MLP
(46.8%) and other baseline systems relying on hand-crafted features.
• Conclusion: Learned features from spectrograms using 2D CNNs capture more
relevant information for genre classification than engineered MFCC features. End-to-
end feature learning shows promise over pipeline systems.
The key ideas are - using 2D CNN on spectrograms for feature learning, end-to-end training,
and demonstrating superior performance over traditional methods relying on MFCC and
other hand-crafted audio features for music classification.
Some of the key problems this work is trying to address for music genre classification are:
• The paper mentions MFCCs lack dynamic analysis capability as they are extracted
from single frames.
• MFCCs may not capture all the relevant information for genre classification.
• Rather than using hand-crafted features, learn features directly from the spectrogram
using convolutional neural nets.
• The 2D convolutional filters can capture patterns across both time and frequency
dimensions of the spectrogram, unlike MFCCs.
4. Translation invariance
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
8
SYSTEM ANALYSIS
The max pooling provides some invariance to pitch shifting or tempo changes.
5. End-to-end learning:
• Compared to systems relying on engineered features, learn the feature extraction and
classification together end-to-end.
• Finding better features from raw audio data rather than relying on hand-crafted
features
• Learning features that capture temporal/spectral patterns
• Achieving some translation invariance
• End-to-end learning of features and classifier
The goal is to show convolutional neural networks can achieve better music genre
classification from raw audio compared to approaches using traditional audio features.
Algorithm:
The proposed algorithm for music genre classification can be summarized as follows:
Input:
Feature Extraction:
Subsampling:
Classification:
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
9
CHAPTER - 6
MODULES
IMPLEMENTATION
6.1 MODULES
• User
• Admin
• Data Preprocessing
• Machine Learning
User
The User can register the first. While registering he required a valid user email and mobile for
further communications. Once the user register then admin can activate the user. Once admin
activated the user then user can login into our system. User can upload the dataset based on our
dataset column matched. For algorithm execution data must be in float format. Here we took
Employment Scam Aegean Dataset (EMSCAD) containing 18000 sample dataset. User can
also add the new data for existing dataset based on our Django application. User can click the
Classification in the web page so that the data calculated Accuracy and macro avg, weighted
avg based on the algorithms. User can display the ml results. user can also display the
prediction results.
Admin
Admin can login with his login details. Admin can activate the registered users. Once he
activate then only the user can login into our system. Admin can view the overall data in the
browser. Admin can click the Results in the web page so calculated Accuracy and macro avg,
weighted avg based on the algorithms is displayed. All algorithms execution complete then
admin can see the overall accuracy in web page. And also display the classification results.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
10
MODULES
Data Preprocessing
They worked on this dataset in three steps- data pre-processing, feature selection and fraud
detection using classifier. In the preprocessing step, they removed noise and html tags from the
data so that the general text pattern remained preserved. They applied feature selection
technique to reduce the number of attributes effectively and efficiently. Support Vector
Machine was used for feature selection and ensemble classifier using random forest was used
to detect fake job posts from the test data. Random forest classifier seemed a tree structured
classifier which worked as ensemble classifier with the help of majority voting technique. This
classifier showed 97.4% classification accuracy to detect fake job posts.
Machine learning
This paper proposed to use different data mining techniques and classification algorithm like
KNN, decision tree, support vector machine, naïve bayes classifier, random forest classifier,
multilayer perceptron and deep neural network to predict a job post if it is real or fraudulent.
The Accuracy and macro avg weighted avg of the classifiers was calculated and displayed in
my results. The classifier which bags up the highest accuracy could be determined as the best
classifier.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
11
CHAPTER - 7
SYSTEM DESIGN
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
12
7.2 DATA FLOW DIAGRAM
1.The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to
represent a system in terms of input data to the system, various processing carried out on this
data, and the output data is generated by this system.
2.The data flow diagram (DFD) is one of the most important modeling tools. It is used to model
the system components. These components are the system process, the data used by the process,
an external entity that interacts with the system and the information flows in the system.
3.DFD shows how the information moves through the system and how it is modified by a series
of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
4.DFD is also known as bubble chart. A DFD may be used to represent a system at any level
of abstraction. DFD may be partitioned into levels that represent increasing information flow
and functional detail.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
13
SYSTEM DESIGN
GOALS
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
14
SYSTEM DESIGN
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
15
SYSTEM DESIGN
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, their attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
16
SYSTEM DESIGN
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
17
SYSTEM DESIGN
Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modeling Language, activity
diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
18
CHAPTER - 8
SOURCE CODE
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
19
SOURCE CODE
def UserLoginCheck(request):
if request.method == "POST":
loginid = request.POST.get('loginid')
pswd = request.POST.get('pswd')
print("Login ID = ", loginid, ' Password = ', pswd)
try:
check = UserRegistrationModel.objects.get(loginid=loginid, password=pswd)
status = check.status
print('Status is = ', status)
if status == "activated":
request.session['id'] = check.id
request.session['loggeduser'] = check.name
request.session['loginid'] = loginid
request.session['email'] = check.email
print("User id At", check.id, status)
return render(request, 'users/UserHomePage.html', {})
else:
messages.success(request, 'Your Account Not at activated')
return render(request, 'UserLogin.html')
except Exception as e:
print('Exception is ', str(e))
pass
messages.success(request, 'Invalid Login id and password')
return render(request, 'UserLogin.html', {})
def UserHome(request):
return render(request, 'users/UserHomePage.html', {})
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
20
SOURCE CODE
def DatasetView(request):
path = settings.MEDIA_ROOT + "//" + 'DataSet.csv'
import pandas as pd
df = pd.read_csv(path, nrows=100,index_col=False)
df.reset_index()
df = df.to_html
return render(request, 'users/viewdataset.html', {'data': df})
def preProcessData(request):
from .utility.PreprocessedData import preProcessed_data_view
data = preProcessed_data_view()
return render(request, 'users/preproccessed_data.html', {'data': data})
def Model_Results(request):
from .utility import PreprocessedData
nb_report = PreprocessedData.build_naive_bayes()
knn_report = PreprocessedData.build_knn()
dt_report = PreprocessedData.build_decsionTree()
rf_report = PreprocessedData.build_randomForest()
svm_report = PreprocessedData.build_svm()
mlp_report = PreprocessedData.build_mlp()
return render(request, 'users/ml_reports.html', {'nb': nb_report,"knn":knn_report, 'dt':
dt_report, 'rf': rf_report, 'svm': svm_report,'mlp':mlp_report})
def user_input_prediction(request):
if request.method=='POST':
from .utility import PreprocessedData
joninfo = request.POST.get('joninfo')
result = PreprocessedData.predict_userInput(joninfo)
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
21
SOURCE CODE
print(request)
return render(request, 'users/testform.html', {'result': result})
else:
return render(request,'users/testform.html',{})
base.html:
{%load static%}
<!DOCTYPE html>
<html>
<head>
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
22
SOURCE CODE
<link href='http://fonts.googleapis.com/css?family=Alegreya+Sans:100,300,400,700'
rel='stylesheet' type='text/css'>
<div id="top"></div>
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
23
SOURCE CODE
{%block contents%}
{%endblock%}
<!-- /.footer -->
<footer id="footer">
<div class="container">
<div class="col-sm-4 col-sm-offset-4">
<!-- /.social links -->
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
24
SOURCE CODE
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
25
SOURCE CODE
<div class="overlay">
<div class="container">
<div class="row">
<div class="col-md-6">
</p>
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
26
SOURCE CODE
</div>
</div>
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
27
SOURCE CODE
</div>
</div>
{%endblock%}
Admin side views:
from django.shortcuts import render, HttpResponse
from django.contrib import messages
from users.models import UserRegistrationModel
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
28
CHAPTER - 9
SYSTEM STUDY
The feasibility of the project is analyzed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During system
analysis the feasibility study of the proposed system is to be carried out. This is to ensure that
the proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are,
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus the developed
system as well within the budget and this was achieved because most of the technologies used
are freely available. Only the customized products had to be purchased.
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the
available technical resources. This will lead to high demands on the available technical
resources. This will lead to high demands being placed on the client. The developed system
must have a modest requirement, as only minimal or null changes are required for
implementing this system.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
29
SYSTEM STUDY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system
and to make him familiar with it. His level of confidence must be raised so that he is also able
to make some constructive criticism, which is welcomed, as he is the final user of the system.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
30
CHAPTER - 10
SYSTEM TEST
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub assemblies, assemblies and/or a finished product It is the process of
exercising software with the intent of ensuring that the Software system meets its requirements
and user expectations and does not fail in an unacceptable manner. There are various types of
test. Each test type addresses a specific testing requirement.
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid outputs. All
decision branches and internal code flow should be validated. It is the testing of individual
software units of the application .it is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more concerned
with the basic outcome of screens or fields. Integration tests demonstrate that although the
components were individually satisfaction, as shown by successfully unit testing, the
combination of components is correct and consistent. Integration testing is specifically aimed
at exposing the problems that arise from the combination of components.
Functional testing
Functional testing provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system documentation, and
user manuals.
Functional testing is centered on the following items:
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
31
SYSTEM TEST
System Testing
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An example of
system testing is the configuration oriented system integration test. System testing is based on
process descriptions and flows, emphasizing pre-driven process links and integration points.
Unit testing is usually conducted as part of a combined code and unit test phase
of the software lifecycle, although it is not uncommon for coding and unit testing to be
conducted as two distinct phases.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
32
SYSTEM TEST
Field testing will be performed manually and functional tests will be written in detail.
Test objectives
Features to be tested
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level
– interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
33
SYSTEM TEST
TESTCASES
10.2 Sample Test Cases
Remarks(IF
S.no Test Case Excepted Result Result
Fails)
If already user
1 User Register If User registration successfully. Pass email exist
then it fails.
If Username and password is Un Register
2 User Login correct then it will getting valid Pass Users will not
page. logged in.
The request
will be
accepted by the
Random forest The request will be accepted by
3 Pass random forest
and svm the random forest and svm
and svm
otherwise its
failed
The request
will be
accepted by the
Decision Tree The request will be accepted by
Decision Tree
4 and multilayer the Decision Tree and multilayer Pass
and multilayer
perceptron perceptron
perceptron
otherwise its
failed
The request
will be
accepted by the
Naive Bayes and The request will be accepted by
Naive Bayes
5 k-nearest the Naive Bayes and k-nearest Pass
and k-nearest
neighbour neighbour
neighbour
otherwise its
failed
View dataset by Data set will be displayed by the Results not
6 Pass
user user true failed
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
34
CHAPTER - 11
OUTPUT SCREENS
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
35
OUTPUT SCREENS
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
36
OUTPUT SCREENS
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
37
OUTPUT SCREENS
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
38
CHAPTER - 12
CONCLUSION
In this paper, a hybrid encoder-decoder based model to generate the effective caption of the
image by using the Flickr8k dataset. During the encoding phase, the proposed model used
transfer learning-based model like VGG16 and ResNet5o and YOLO for extracting the image
features. A concatenate function is used to combine the feature and removes the duplicate one.
For the decoding, BiGRu and LSTM are used to get the complete caption of the image. Further
BLEU value is evaluated of both the captions generated by BiGRU and LSTM. Final caption
is considered whose METEOR value is high. The proposed model is also evaluated by
METEOR and ROUGE. The proposed model achieved score BLUE-1: 0.67, METEOR: 0.54
and ROUGE: 0.31 on Flickr8k dataset. The experimental results show the better results through
BLUE, METEOR and ROUGE when compared to another state-of-art models. The model is
also helpful in generating the captions at real time.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
39
CHAPTER - 13
Future Enhancement
potential future enhancements for the field of image caption generation using efficient deep
learning based hybrid models.:
1.Improved Attention Mechanisms: Enhance attention mechanisms within the model to better
focus on relevant regions of the image when generating captions. Exploring variants of
attention, such as self-attention or multi-head attention, could lead to more accurate and
contextually relevant captions.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
40
CHAPTER - 14
REFERENCES
[1] J. Gu, G. Wang, J. Cai, and T. Chen, “An Empirical Study of Language CNN for Image
Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-October, pp. 1231–1240, 2017,
doi: 10.1109/ICCV.2017.138.
[2] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional Image Captioning,” Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 5561–5570, 2018, doi:
10.1109/CVPR.2018.00583.
[3] K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention.” Available: http://proceedings.mlr.press/v37/xuc15.
[4] K. Xu, H. Wang, and P. Tang, “Image Captioning With Deep Lstm Based On Sequential
Residual Department of Computer Science and Technology , Tongji University , Shanghai , P
. R . China Key Laboratory of Embedded System and Service Computing , Ministry of
Education ,” no. July, pp. 361–366, 2017.
[5] S. Liu, L. Bai, Y. Hu, and H. Wang, “Image Captioning Based on Deep Neural Networks,”
MATEC Web Conf., vol. 232, pp. 1–7, 2018, doi: 10.1051/matecconf/201823201052.
[6] R. Subash, R. Jebakumar, Y. Kamdar, and N. Bhatt, “Automatic image captioning using
convolution neural networks and LSTM,” J. Phys. Conf. Ser., vol. 1362, no. 1, 2019, doi:
10.1088/1742- 6596/1362/1/012096.
[7] C. Wang, H. Yang, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs and
Multi-Task Learning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 14, no. 2s, 2018,
doi: 10.1145/3115432.
[8] M. Han, W. Chen, and A. D. Moges, “Fast image captioning using LSTM,” Cluster
Comput., vol. 22, pp. 6143–6155, May 2019, doi: 10.1007/s10586-018-1885-9.
[9] H. Dong, J. Zhang, D. Mcilwraith, and Y. Guo, “I2T2I: Learning Text To Image Synthesis
With Textual Data Augmentation.”
[10] Y. Xian and Y. Tian, “Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect
Training Dataset for Image Captioning,” IEEE Trans. Image Process., vol. 28, no. 11, pp. 5241–
5252, 2019, doi: 10.1109/TIP.2019.2917229.
AN EFFICIENT DEEP LEARNING BASED HYBRID MODEL FOR IMAGE CAPTION GENERATION
41