LLM For QnA Proposal
LLM For QnA Proposal
Relevant Links :-
Portfolio: https://divyansh7.netlify.app/
Resume: Divyansh_Resume.pdf
Github: https://github.com/divyansh-tripathi7
Linkedin: https://www.linkedin.com/in/divyansh-tripathi-4bb66a205/
Technical Skills:-
- Python: Expert
- NLP: Expert
- Deep Learning: Expert
- Tableau: Expert
- PowerBI: Expert
- Git: Expert
- Computer Vision: Intermediate
- Machine Learning: Intermediate
- Embedded ML: Intermediate
- PyTorch: Intermediate
- TensorFlow: Intermediate
- HuggingFace: Intermediate
- AWS/GCP: Intermediate
- Flask: Intermediate
- Django: Intermediate
- C++: Intermediate
- Arduino: Intermediate
- Front End Web Dev: Novice
- BackEnd Web Dev: Novice
TITLE: LLMs for Question Answering 1
Summary :-
My approach towards this project would be to first select the correct vector DB for our
requirements from open source options like Weavite, Milvus, Quadrant, etc.
Next creation of data ingestion pipeline to extract data from source. Initially I will make it for pdfs
Then move onto multimodal media like images, video and audio.Now we will have to create
context rich paras out of this data using tokenization and neural coreferencing and Finally we
feed these paras to our vector Database.
Project Detail
Project Overview:-
The project comprises leveraging the power of LLMs to create question answers based on
context and the main problem we are looking to solve is to provide the LLM with concise and
relevant context to answer questions. The context is to be chosen using a vector Database(to
be selected) by selecting top n documents with highest similarity with the user query.
Certainly to make this work first we need to ingest all the relevant data into the selected
database to which requires a data ingestion pipeline
Problems:-
Solutions:-
1) The data ingestion pipeline is to be made such that it can extract data from different
source types like pdfs, audio, images, etc. This can be done using diff tools for different
media types like pyPDF for pdf files, OpenCV for images and videos, Librosa for audio.
2) This context enhancement can be achieved by neural coreference resolution technique
which replaces pronouns and nouns with their antecedents, this can be done using
NeuralCoref library of python.
The milestones that are to be achieved in the fulfillment of this project are:-
MIlestone 1 : Selection of scalable VectorDB
Milestone 2 : Creation of Data ingestion Pipeline
Milestone 3 : Creating of embeddings and Context rich paragraphs
Milestone 4 : Feeding these paras into the database.
Implementation details:-
Milestone 1
Selection of scalable VectorDB
A house is only as strong as its foundation and this is the quote which will suit this milestone as
our project revolves around VectorDB and this step needs extensive investigation and research
to select the correct database.The steps involved in this milestone are :-
2. Evaluate Features:
- Assess the feature sets of vector databases such as Elasticsearch, Milvus, QDrant, and
Weaviate.
- Pay attention to additional features like support for distributed setups, data sharding, data
replication, and fault tolerance mechanisms.
6. Review Licensing:
- Carefully review the licensing terms and restrictions of Elasticsearch, Milvus, QDrant, and
Weaviate to ensure they align with your project's licensing and legal requirements.
9. Make Decision:
- Based on the evaluation of the above steps, compare the strengths and weaknesses of
Elasticsearch, Milvus, QDrant, and Weaviate.
Here we will start working with textual data in pdfs and move up to other media formats.
1. Text Data:
- Plain Text: Extracting data from plain text files is straightforward. We can read the text
content directly using file I/O operations.
- PDFs: For PDF files, we can use libraries like PyPDF2, pdftotext, or PDFMiner to extract text
content from PDF documents. These libraries provide methods to parse and extract text from
PDF files.
2. Image Data:
- Image Files: Reading image data from image files (e.g., JPEG, PNG, BMP) can be done
using image processing libraries like OpenCV or Pillow in various programming languages.
These libraries provide functions to read and process images.
3. Audio Data:
- Audio Files: Reading audio data from common audio file formats (e.g., WAV, MP3, FLAC)
can be achieved using libraries like Librosa or Pydub. These libraries provide functions to load
and process audio files.
- Speech-to-Text: To extract text from audio, we can utilize Automatic Speech Recognition
(ASR) systems. Services like Google Cloud Speech-to-Text, Mozilla DeepSpeech, or
CMUSphinx can be employed for transcribing speech to text.
4. Video Data:
- Video Files: Extracting data from video files involves processing individual frames. Libraries
like OpenCV provide functions to read video files and extract frames. We can then apply image
processing techniques or pre-trained CNN models to analyze each frame.
To increase the context in paragraphs created from collected data using neural coreference ,
you can follow these steps:
1. Collect and Preprocess Data: Gather the relevant data from different sources and
preprocess it to extract paragraphs or textual content that you want to enhance with coreference
resolution. Ensure the data is cleaned and in a format suitable for further processing.
2. Neural Coreference Resolution: Neural coreference resolution aims to identify and replace
pronouns and noun phrases with their appropriate antecedents to improve context. You can use
pre-trained models such as the NeuralCoref library in Python, which is built on top of spaCy and
uses a neural network approach for coreference resolution. Apply the neural coreference model
to your paragraphs to resolve references and expand the context.
3. Context Expansion: After performing coreference resolution, replace pronouns and noun
phrases with their resolved antecedents. This will help provide more context and improve the
overall readability of the paragraphs. Ensure to maintain the coherence and coherence of the
text while expanding the context.
4. Vectorization: Once you have expanded the context in your paragraphs, you can use
techniques like BERT or other text embedding methods to convert the text into fixed-length
vector representations. Apply the same vectorization approach you mentioned earlier to encode
the expanded paragraphs as vectors.
By incorporating neural coreference resolution, you can enhance the context and readability of
the paragraphs, making them more informative and coherent.
Store in Vector Database: Choose a suitable vector database like Elasticsearch, Faiss, or
Apache Cassandra, and store the expanded paragraphs along with their associated metadata in
the database. Ensure to index the vectors appropriately for efficient retrieval.
Retrieval: When you want to retrieve the expanded paragraphs, you can input a query or search
terms to the vector database. The database can then use its indexing and search capabilities to
retrieve the most relevant paragraphs based on vector similarity.
No of Hours/week?
I can commit about 5hr/day that is around 30 hrs/week for this project
I am deeply excited about the field of artificial intelligence, specifically Natural Language Processing. The
recent surge in the field of (NLP), has presented new and exciting opportunities to work on diverse
problems and develop innovative solutions.
Joining this program would allow me to ride the wave of advancement in the world of Language Models
(LLMs) and make a significant contribution. With the guidance of experienced mentors and the supportive
Karmayogi community, I am confident that this project is the perfect platform for me to embark on my
journey. It provides the ideal environment to apply my knowledge and skills, keeping me at the forefront of
AI developments.
By participating in this program, I aim to make a meaningful impact in the field of LLMs and leverage the
power of artificial intelligence to address real-world challenges. This opportunity will serve as a launchpad
for me to pursue my ambitions, learn from experienced professionals, and actively contribute to the
ever-evolving world of AI.
Relevant Projects:-
1) EduBot:-
Description:- An all in one learning platform for all kinds of users.
Link:- https://github.com/swastkk/eduBOT
2)Text Enhancer:-
Description:- A Streamlit application that summarizes documents using LaMini-LM.
Link:- https://github.com/divyansh-tripathi7/TextEnhancer
3)Plagariasim Detector:-
Description:- Detects plagiarism in text using burstiness and perplexity metrics.
Link:- https://github.com/divyansh-tripathi7/PlagiarismDetector
4) VisualQnA
Description:- Visual Question Answering API and App using ViLT, Fast API, and
Streamlit
Link:- https://github.com/divyansh-tripathi7/VisualQnA
I would like to express my deepest gratitude for the opportunity to submit my proposal and for
the guidance and support provided throughout this process. Your expertise and mentorship
have been invaluable in shaping my understanding of the project and its implementation. I am
truly grateful for the platform provided by Karmayogi Organization, and I look forward to the
journey ahead, working together to leverage the power of LLMs for question answering. Thank
you for believing in my potential and for the opportunity to contribute to this meaningful project.
Sincerely,
Divyansh Tripathi