0% found this document useful (0 votes)
38 views

LLM For QnA Proposal

The project aims to leverage LLMs for question answering by providing relevant context from documents. Key steps include selecting a vector database, creating a data ingestion pipeline to extract text and other media from sources, using neural techniques like BERT and coreference resolution to generate embeddings and expand context, and storing these in the database.

Uploaded by

Akhil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

LLM For QnA Proposal

The project aims to leverage LLMs for question answering by providing relevant context from documents. Key steps include selecting a vector database, creating a data ingestion pipeline to extract text and other media from sources, using neural techniques like BERT and coreference resolution to generate embeddings and expand context, and storing these in the database.

Uploaded by

Akhil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Name :- Divyansh Tripathi

Contact Information : - 1) Phone number:- 7018753780


2) Email ID - [email protected]

Current Occupation :- Student

Education Details :- B.Tech Electronics and Communication Engineering.


NIT Hamirpur
Final Year( start from Mid July)

Relevant Links :-
Portfolio: https://divyansh7.netlify.app/
Resume: Divyansh_Resume.pdf
Github: https://github.com/divyansh-tripathi7
Linkedin: https://www.linkedin.com/in/divyansh-tripathi-4bb66a205/

Technical Skills:-
- Python: Expert
- NLP: Expert
- Deep Learning: Expert
- Tableau: Expert
- PowerBI: Expert
- Git: Expert
- Computer Vision: Intermediate
- Machine Learning: Intermediate
- Embedded ML: Intermediate
- PyTorch: Intermediate
- TensorFlow: Intermediate
- HuggingFace: Intermediate
- AWS/GCP: Intermediate
- Flask: Intermediate
- Django: Intermediate
- C++: Intermediate
- Arduino: Intermediate
- Front End Web Dev: Novice
- BackEnd Web Dev: Novice
TITLE: LLMs for Question Answering 1

Summary :-

My approach towards this project would be to first select the correct vector DB for our
requirements from open source options like Weavite, Milvus, Quadrant, etc.
Next creation of data ingestion pipeline to extract data from source. Initially I will make it for pdfs
Then move onto multimodal media like images, video and audio.Now we will have to create
context rich paras out of this data using tokenization and neural coreferencing and Finally we
feed these paras to our vector Database.

Project Detail

Project Overview:-

The project comprises leveraging the power of LLMs to create question answers based on
context and the main problem we are looking to solve is to provide the LLM with concise and
relevant context to answer questions. The context is to be chosen using a vector Database(to
be selected) by selecting top n documents with highest similarity with the user query.
Certainly to make this work first we need to ingest all the relevant data into the selected
database to which requires a data ingestion pipeline

Problems:-

1) Dealing with multimodal data


2) Enhancing Para’s Context

Solutions:-

1) The data ingestion pipeline is to be made such that it can extract data from different
source types like pdfs, audio, images, etc. This can be done using diff tools for different
media types like pyPDF for pdf files, OpenCV for images and videos, Librosa for audio.
2) This context enhancement can be achieved by neural coreference resolution technique
which replaces pronouns and nouns with their antecedents, this can be done using
NeuralCoref library of python.
The milestones that are to be achieved in the fulfillment of this project are:-
MIlestone 1 : Selection of scalable VectorDB
Milestone 2 : Creation of Data ingestion Pipeline
Milestone 3 : Creating of embeddings and Context rich paragraphs
Milestone 4 : Feeding these paras into the database.

Implementation details:-

Milestone 1
Selection of scalable VectorDB

A house is only as strong as its foundation and this is the quote which will suit this milestone as
our project revolves around VectorDB and this step needs extensive investigation and research
to select the correct database.The steps involved in this milestone are :-

1. Define Project Requirements:


- Gather and document the specific requirements of the project, including the expected
dataset size, query volume, scalability needs, and deployment considerations.

2. Evaluate Features:
- Assess the feature sets of vector databases such as Elasticsearch, Milvus, QDrant, and
Weaviate.
- Pay attention to additional features like support for distributed setups, data sharding, data
replication, and fault tolerance mechanisms.

3. Perform Performance Evaluation:


- Conduct comprehensive performance testing to measure the indexing and querying speed,
search accuracy, and resource utilization of the vector databases.
- Evaluate the scalability of the databases by testing them with increasing dataset sizes and
query loads.

4. Assess Community & Support:


- Evaluate the size and activity of the communities around Elasticsearch, Milvus, QDrant, and
Weaviate.
- Explore the availability of official documentation, tutorials, and forums where you can seek
help and learn from the experiences of others.

5. Evaluate Integration & Language Support:


- Determine the compatibility of Elasticsearch, Milvus, QDrant, and Weaviate with your
existing technology stack and programming languages.
- Evaluate the availability of client libraries or APIs in your preferred programming languages,
as well as the ease of integration with your existing infrastructure and workflows.

6. Review Licensing:
- Carefully review the licensing terms and restrictions of Elasticsearch, Milvus, QDrant, and
Weaviate to ensure they align with your project's licensing and legal requirements.

7. Explore Case Studies & Real-world Adoption:


- Research and study case studies or real-world implementations of Elasticsearch, Milvus,
QDrant, and Weaviate in domains similar to your project.
- Analyze how other organizations have utilized these databases and the challenges they
faced during implementation and deployment.
- Look for success stories and use cases that align with your project's goals to gain insights
into the practical benefits and limitations of each database.

8. Prototype and Testing:


- Set up small-scale prototypes to test Elasticsearch, Milvus, QDrant, and Weaviate in your
project environment.
- Evaluate the ease of installation, configuration, and data ingestion processes.
- Perform extensive testing with representative data and complex queries to validate the
performance, accuracy, and reliability of each database.

9. Make Decision:
- Based on the evaluation of the above steps, compare the strengths and weaknesses of
Elasticsearch, Milvus, QDrant, and Weaviate.

Approximate Time Required - 1 week


Milestone 1 Process DIagram
Milestone 2
Creation of Data ingestion Pipeline

Here we will start working with textual data in pdfs and move up to other media formats.

1. Text Data:
- Plain Text: Extracting data from plain text files is straightforward. We can read the text
content directly using file I/O operations.
- PDFs: For PDF files, we can use libraries like PyPDF2, pdftotext, or PDFMiner to extract text
content from PDF documents. These libraries provide methods to parse and extract text from
PDF files.

2. Image Data:
- Image Files: Reading image data from image files (e.g., JPEG, PNG, BMP) can be done
using image processing libraries like OpenCV or Pillow in various programming languages.
These libraries provide functions to read and process images.

3. Audio Data:
- Audio Files: Reading audio data from common audio file formats (e.g., WAV, MP3, FLAC)
can be achieved using libraries like Librosa or Pydub. These libraries provide functions to load
and process audio files.
- Speech-to-Text: To extract text from audio, we can utilize Automatic Speech Recognition
(ASR) systems. Services like Google Cloud Speech-to-Text, Mozilla DeepSpeech, or
CMUSphinx can be employed for transcribing speech to text.

4. Video Data:
- Video Files: Extracting data from video files involves processing individual frames. Libraries
like OpenCV provide functions to read video files and extract frames. We can then apply image
processing techniques or pre-trained CNN models to analyze each frame.

Approximate Time Required : 2 weeks


Milestone 2 + Milestone 3(embeddings) Work Breakdown Structure
Milestone 3
Creating of embeddings and Context rich paragraphs

To increase the context in paragraphs created from collected data using neural coreference ,
you can follow these steps:

1. Collect and Preprocess Data: Gather the relevant data from different sources and
preprocess it to extract paragraphs or textual content that you want to enhance with coreference
resolution. Ensure the data is cleaned and in a format suitable for further processing.

2. Neural Coreference Resolution: Neural coreference resolution aims to identify and replace
pronouns and noun phrases with their appropriate antecedents to improve context. You can use
pre-trained models such as the NeuralCoref library in Python, which is built on top of spaCy and
uses a neural network approach for coreference resolution. Apply the neural coreference model
to your paragraphs to resolve references and expand the context.

3. Context Expansion: After performing coreference resolution, replace pronouns and noun
phrases with their resolved antecedents. This will help provide more context and improve the
overall readability of the paragraphs. Ensure to maintain the coherence and coherence of the
text while expanding the context.

4. Vectorization: Once you have expanded the context in your paragraphs, you can use
techniques like BERT or other text embedding methods to convert the text into fixed-length
vector representations. Apply the same vectorization approach you mentioned earlier to encode
the expanded paragraphs as vectors.

For creating embeddings of different data types we can use:-


BERT like models for text
CNNs like ResNet, VGG for image and video
MFCCs or RNNs for Audio data

By incorporating neural coreference resolution, you can enhance the context and readability of
the paragraphs, making them more informative and coherent.

Approximate Time Required :- 2 weeks


Milestone 4
Storing into Vector Databases

Store in Vector Database: Choose a suitable vector database like Elasticsearch, Faiss, or
Apache Cassandra, and store the expanded paragraphs along with their associated metadata in
the database. Ensure to index the vectors appropriately for efficient retrieval.

Retrieval: When you want to retrieve the expanded paragraphs, you can input a query or search
terms to the vector database. The database can then use its indexing and search capabilities to
retrieve the most relevant paragraphs based on vector similarity.

Approximate time Required :- 1 week

Fig explaining interaction between database and our data


Proposed Timeline:-

Milestone 1 is Selection of Vector DB


Milestone 2 is Creation of Data Ingestion Pipeline
Review Period is to share updates with mentors and community and work on the received
feedback.

Milestone 3 is Creation of embeddings and enhancing context of paras


Milestone 4 is Ingestion of data into Database
Testing Period is to test functioning of various parts and complete documentation.
Availability:-

No of Hours/week?
I can commit about 5hr/day that is around 30 hrs/week for this project

Do you have any other engagements during this period ?


No, currently I do not have any engagements during the tenure of this project.

Share any other details about your availability clearly here


I should make clear that I will be starting with my college 7th semester in the 3rd week of july so
It is the reason for assigning 5hrs/day as it sounds reasonable to me.And my semester exams
will be around mid September so they won’t be hindering my project work.

Personal Information and Motivation :-

I am deeply excited about the field of artificial intelligence, specifically Natural Language Processing. The
recent surge in the field of (NLP), has presented new and exciting opportunities to work on diverse
problems and develop innovative solutions.

Joining this program would allow me to ride the wave of advancement in the world of Language Models
(LLMs) and make a significant contribution. With the guidance of experienced mentors and the supportive
Karmayogi community, I am confident that this project is the perfect platform for me to embark on my
journey. It provides the ideal environment to apply my knowledge and skills, keeping me at the forefront of
AI developments.

By participating in this program, I aim to make a meaningful impact in the field of LLMs and leverage the
power of artificial intelligence to address real-world challenges. This opportunity will serve as a launchpad
for me to pursue my ambitions, learn from experienced professionals, and actively contribute to the
ever-evolving world of AI.
Relevant Projects:-

1) EduBot:-
Description:- An all in one learning platform for all kinds of users.
Link:- https://github.com/swastkk/eduBOT

2)Text Enhancer:-
Description:- A Streamlit application that summarizes documents using LaMini-LM.
Link:- https://github.com/divyansh-tripathi7/TextEnhancer

3)Plagariasim Detector:-
Description:- Detects plagiarism in text using burstiness and perplexity metrics.
Link:- https://github.com/divyansh-tripathi7/PlagiarismDetector

4) VisualQnA
Description:- Visual Question Answering API and App using ViLT, Fast API, and
Streamlit
Link:- https://github.com/divyansh-tripathi7/VisualQnA

Thank You Note :-

I would like to express my deepest gratitude for the opportunity to submit my proposal and for
the guidance and support provided throughout this process. Your expertise and mentorship
have been invaluable in shaping my understanding of the project and its implementation. I am
truly grateful for the platform provided by Karmayogi Organization, and I look forward to the
journey ahead, working together to leverage the power of LLMs for question answering. Thank
you for believing in my potential and for the opportunity to contribute to this meaningful project.

Sincerely,
Divyansh Tripathi

You might also like