Providing Accurate Data:: How Does It Work?
Providing Accurate Data:: How Does It Work?
Oral squamous cell carcinoma (OSCC) is a type of cancer that develops in the
squamous cells lining the inside of your mouth and throat. Squamous cells are thin,
flat cells that form the surface layer of the skin and many other tissues in the body.
Some of the key risk factors for OSCC are smoking, tobacco and alcohol that damage
the DNA of squamous cells, the lining of the mouth and throat
SLIDE 11:
Firstly, we know about what is LLM?
Large Language Model (LLM)
Large Language Models (LLMs) are a type of artificial intelligence model that excels
in understanding and generating human-like text
How does it work?
It learns from tons of information on the internet, figuring out how
words and sentences fit together.
Large Language Models (LLMs) work by learning from massive
amounts of text data, using a neural network architecture called
transformers. They understand language patterns, context, and
relationships during pre-training. Afterward, they can be fine-tuned for
specific tasks. During inference, LLMs generate text based on learned
patterns, making them versatile for tasks like translation,
summarization, and question-answering.
Examples:
GPT-3 is a famous one. It's like having a smart friend who's really
good with words and can help with all sorts of language-related tasks.
LangChain:
LangChain is a framework that makes it easier to create applications using large
language models (LLMs). It provides:
Tools: These are resources that help in building and customizing applications.
Abstractions: These are simplified models or templates that hide complex
details, making development easier.
Developers use LangChain to create and tailor applications for various tasks, such as:
In essence, LangChain helps streamline and simplify the development process for
applications that leverage the power of large language models.
SLIDE 12
1. Load: This step involves loading different types of data sources such as JSON
files, URLs, documents, and other file formats.
2. Split: The loaded data is then split into smaller, manageable chunks. This is
often done to ensure the data can be processed more efficiently and to fit
within the constraints of downstream models.
3. Embed: Each of these chunks is then transformed into a numerical
representation (embedding). This involves converting the textual data into
vectors that capture the semantic meaning of the text.
4. Store: The embeddings are stored in a database or some storage system. This
allows for efficient retrieval of relevant chunks based on a query.
5. Retrieve: When a question is posed to the system, it retrieves the most
relevant chunks of text from the stored embeddings. This retrieval step is
based on the similarity between the query and the stored embeddings.
6. Prompt: The retrieved chunks are then used to create a prompt for the
language model.
7. LLM (Large Language Model): The prompt is fed into a large language
model, which processes the information and generates an answer.
8. Answer: The final step is delivering the answer generated by the language
model to the user.
This flow enables the system to answer questions by leveraging stored embeddings
and a large language model, ensuring efficient and relevant retrieval of information
for accurate responses.
SLIDE 15
The Text Splitting pre-processing step is crucial for managing large texts and making
them more suitable for processing by downstream components. The
RecursiveCharacterTextSplitter class is designed to handle this task efficiently. It
operates by recursively splitting text based on a list of user-defined characters,
ensuring that related pieces of text remain adjacent to each other, thus preserving their
semantic relationship.
Text Splitting with RecursiveCharacterTextSplitter
Purpose
To break large texts into smaller, manageable pieces that make sense on their
own.
How It Works
1. Start with Large Text:
o Begin with a big text, like a long article or document.
2. Initial Split:
o First, divide the text into big chunks, like paragraphs.
3. Check Chunk Size:
o If any chunk (paragraph) is too big, split it further.
4. Recursive Splitting:
o Split the big chunks into smaller pieces, like sentences.
o If a sentence is too long, split it into smaller parts, like phrases.
5. Keep Going:
o Repeat this process until all pieces are a manageable size.
Benefits
Easier to Handle: Smaller pieces are easier to work with.
Makes Sense: Each piece still makes sense on its own.
Flexible: Adjusts to different text sizes automatically.
Example
You have a long document.
Split it into paragraphs.
If a paragraph is too long, split it into sentences.
If a sentence is too long, split it into shorter parts.
This way, you end up with small, meaningful chunks of text that are easy to process
further.
Slide 16
Chunk Size:
Refers to the maximum desired length of each individual chunk in
characters.
It sets a target upper limit for the number of characters in each segment.
In your code, chunk_size is set to 1000, indicating that ideally, each chunk
should contain no more than 1000 characters.
Chunk Overlap:
SLIDE 17
Embeddings
• Dimensionality Reduction:
• Semantic Similarity:
• Words or phrases with similar meanings are represented by vectors that are
close to each other in the embedding space. For instance, the words "king" and
"queen" would have vectors that are closer together than "king" and "car.
Slide 18:
INSTRUCTOR Embeddings
Slide 19
2. Training Objective:
Contrastive loss: This loss function pushes the model to create embeddings
(numerical representations) that group similar texts together and separate
dissimilar texts. This helps the model distinguish between relevant and
irrelevant information during information search.
Variety is Key: MEDI contains 300 datasets from a source called Super-
Natural Instructions. These datasets cover many different tasks, like
classification, summarization, and question answering. Each task comes with
specific instructions, guiding INSTRUCTOR on how to handle it.
Extra Practice: MEDI also includes 30 additional datasets, giving
INSTRUCTOR even more examples to learn from and improve its text
understanding skills.
Slide 20
Retriever Overview
A retriever's primary function is to fetch the most relevant documents or data points
from a vector store based on a user's query. Here's how it accomplishes this:
1. Vector Store Backbone: The vector store serves as the repository where
documents or data points are stored as vectors (numeric representations).
These vectors capture essential features or characteristics of each document or
data point.
2. Query Processing: When a user submits a query to the retriever, it takes this
query and uses it to identify which documents or data points in the vector store
are most relevant to the query.
Search Methods
SLIDE 24
The code snippet you're referring to is using a retriever with the maximal marginal
relevance (MMR) search strategy in a vector database (vectordb). This approach
aims to balance relevance and diversity of the retrieved documents.
To provide more context or an example of how this can be integrated into a larger
system, please let me know more about your specific use case or the surrounding
code.
SLIDE 25
Large Language Models (LLMs) are powerful AI models trained on massive amounts
of text data to understand and generate human language. Here are some key features:
These features make LLMs versatile tools for natural language processing tasks
across different domains and applications.
SLIDE 26