Introduction To Docs and Image Based Voice Chatbots
Introduction To Docs and Image Based Voice Chatbots
and Image-Based
Voice Chatbots
The project focuses on creating a voice chatbot that can read and understand
documents, like PDFs, Images and respond to voice queries.
SUBMITTED BY :
RAKESH H R 1BM21EC413
SHIVANI S NAIK 1BM20EC142
VANSH JAIN 1BM20EC183
JAIDEEP A HEGWAD 1BM20EC059
Problem Definition
Integrating voice interaction Documents and images.
The project confronts the significant Traditional voice chatbots are adept at
challenge of integrating voice interaction handling spoken or written queries but fall
with the ability to process and interpret short when users need to extract and
both textual and visual data from PDF files discuss content from documents and
and images. images.
Accessibility Challenges
This limitation is particularly acute in sectors where information is conveyed through a
combination of text and visuals, such as academic research, technical manuals, and medical
imaging.
Proposed Solution
To address the problem of inefficient and time-consuming document and image
retrieval during voice-based interactions, we propose a comprehensive solution. This
innovative system will leverage advanced natural language processing and computer
vision techniques to seamlessly integrate textual and visual information into a voice
chatbot interface.
Voice Chatbot: Develop a sophisticated voice chatbot that can read, understand, and
interact based on the content of uploaded PDFs and other documents.
Broad Applicability: This solution has the potential to revolutionize industries such as
education, customer support, and accessibility by providing a more natural
communication interface.
Road maps
1 Landing Page & Navigation Page
2 Authentication
3 Functionality
Text Chunking: The extracted text is split into manageable chunks using
Langchain’s Recursive Character Text Splitter.
Vector Store Creation: Text chunks are converted into embeddings and indexed
using FAISS for quick retrieval.
User Interaction: Users can interact with the chatbot via a Streamlit interface,
asking questions that the chatbot answers based on the PDF content.
Flow Chart of Functionality
Project Flow
1.User Interface (UI): This is where users interact with the chatbot through voice commands or text
input. It’s designed to be intuitive and user-friendly.
2.Voice Recognition: When a user speaks, this component converts the spoken words into text using
speech-to-text technology.
3.Text Processing: This core part uses natural language processing (NLP) to understand the user’s
intent and context from the text.
4.Document Processing:
• PDF Processing: Extracts text from PDF files using OCR technology.
• Image Processing: Analyzes images to understand content like charts or graphs.
5.Dialogue Management: Manages the conversation flow, deciding how the chatbot should respond
based on the user’s queries and the information extracted from documents and images.
6.Response Generation: Uses NLP to create a natural and relevant response, which is then converted
from text to speech if needed.
7.Learning Component: Gathers data from interactions to improve the chatbot’s performance over
time.
Architecture of CHATBOT
Technologies Used
Streamlit PyPDF2 Langchain Google
Generative AI
For creating the web To read PDF files and For text splitting and
application interface. extract text. managing For generating
conversational chains. embeddings and
responses.
FAISS: Dotenv
For efficient similarity For managing
search and indexing of environment variables.
text chunks.
LITERATURE SURVEY
NO AUTHOR TITLE PAPERS OUTCOME DRAWBACK
1. M. A. Khadija Designing a PDF- 2023 1. Development of a PDF-Driven Chatbot 1. E-books are perceived
Driven Chatbot International using Generative AI. as uncomfortable for
A. Aziz, powered by OpenAI Conference 2. Utilization of LangChain Framework, prolonged reading
ChatGPT, on Computer Chat-GPT (GPT3.5 Turbo), and Pinecone for sessions.
response generation. 2. Potential limitations in
3. Successful demonstration of the chatbot's accessibility and
ability to provide coherent responses aligned readability for some
with the content of PDF documents. users.
2. Semmy Wellem AI-powered Chatbot 2023 5th 1. Introduction of Unklabot 1.0, showcasing 1. Dependency on
Taju, Andria for Information International innovative integration of advanced AI external API (OpenAI
Kusuma Wahyudi, Service at Klabat Conference technologies for information services within GPT-3) might lead to
Green Ferry University by on Klabat University. potential limitations or
Mandias, Reymon Integrating OpenAI Cybernetics 2. Improved accuracy and efficiency in disruptions in service if
Rotikan, Jimmy GPT-3 with Intent and question answering capabilities through the the API becomes
Herawan Recognition and Intelligent integration of intent recognition and semantic unavailable or undergoes
Semantic Search. System search techniques. changes. 2. Lack of
discussion on potential
privacy or security
concerns associated with
using an external AI
model for handling
NO AUTHOR TITLE PAPERS OUTCOME DRAWBACK
3 Max Dean, An AI Chatbot 2023 31st Irish Conference on 1. Development of a large 1. Dependency on
Michael F. for Interacting Artificial Intelligence language model (LLM) arXiv restricts the
McTear, Raymond with Academic augmentation chatbot diversity of papers
R. Bond and Research, tailored for computer and may limit the
Maurice D. science research queries. applicability of the
Mulvenna 2. Embedding of around chatbot to broader
200,000 computer science research domains.
research papers from arXiv, 2. Limited testing
resulting in ~11 million scope with only 30
vectors. sample questions
may not fully
capture the breadth
of inquiries in
computer science.
4 T. -H. Kim, S. Cho, S. Emotional Voice 2020 IEEE International Conference 1. Introduction of a voice 1. Previous VC methods
Choi, S. Park and S. -Y. Conversion Using converter using multitask learning based on seq2seq
Lee Multitask Learning with text-to-speech (TTS). models risk losing
with Text-To-Speech 2. Multitask learning aids in linguistic information.
capturing linguistic information 2. Textual supervision
and maintaining training stability. attempted to address this
but required explicit
alignment, nullifying the
benefits of seq2seq
models.
Efficient Indexing: Pinecone uses advanced indexing techniques
optimized for high-dimensional vector embeddings, enabling fast
similarity search. Scalability: Pinecone is built to handle large-scale
deployments, allowing you to store and search billions of vectors
with low latency. API Integration: Pinecone provides easy-to-use
APIs for inserting vectors, querying for nearest neighbors, and
managing indexes.
5 T. N. Thi, T. -H. Implementatio 2023 1. Successful development of 1. Limited exploration of
Do and M. Yoo n of OCR Internatio an OCR system tailored for alternative OCR models
system on nal Vietnamese book cover images. beyond those mentioned.
extracting Conferen 2. Demonstrated effectiveness 2. Lack of comparative
information ce of EAST and SAST for text analysis between different
from detection, and CRNN, SVTR, combinations of text
Vietnamese Transformer OCR for text detection and recognition
book cover recognition. models.
images
By integrating voice commands with the ability to process and understand content
from both images and PDF files, this chatbot transcends traditional text-based
systems.
It offers a versatile and dynamic tool that caters to a wide range of applications,
from educational resources to technical support and beyond.
The project’s success lies in its innovative approach to combining OCR and image
recognition with NLP, providing users with an intuitive and efficient way to access
and interact with information.
As we look to the future, the potential for further development and integration into
various industries holds the promise of transforming how we engage with digital
content