0% found this document useful (0 votes)
7 views

Providing Accurate Data:: How Does It Work?

Uploaded by

warda.bsbi680
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Providing Accurate Data:: How Does It Work?

Uploaded by

warda.bsbi680
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Slide 1:

Oral squamous cell carcinoma (OSCC) is a type of cancer that develops in the
squamous cells lining the inside of your mouth and throat. Squamous cells are thin,
flat cells that form the surface layer of the skin and many other tissues in the body.
Some of the key risk factors for OSCC are smoking, tobacco and alcohol that damage
the DNA of squamous cells, the lining of the mouth and throat

Basically we have designed a Clinical Decision Support System (CDSS) in the


context of OSCC. A CDSS is a computer program designed to assist healthcare
professionals and the patient themselves in answering their queries related to OSCC.
Here's how a CDSS might be helpful for OSCC:

 Providing accurate data: A CDSS can integrate information from various


sources, including , symptoms, prognosis , impacts etc to provide a comprehensive
picture of the patient's condition.
 Aligning data with user needs: It can tailor the information presented to the specific
needs of the user, highlighting relevant data points and potential treatment options for
OSCC based on the specific case.
 Timely decision making: By offering quick access to relevant information, a CDSS
can help a user to make timely decisions about the best course of treatment for OSCC
patients

SLIDE 11:
Firstly, we know about what is LLM?
Large Language Model (LLM)
Large Language Models (LLMs) are a type of artificial intelligence model that excels
in understanding and generating human-like text
How does it work?
 It learns from tons of information on the internet, figuring out how
words and sentences fit together.
 Large Language Models (LLMs) work by learning from massive
amounts of text data, using a neural network architecture called
transformers. They understand language patterns, context, and
relationships during pre-training. Afterward, they can be fine-tuned for
specific tasks. During inference, LLMs generate text based on learned
patterns, making them versatile for tasks like translation,
summarization, and question-answering.
Examples:
 GPT-3 is a famous one. It's like having a smart friend who's really
good with words and can help with all sorts of language-related tasks.
LangChain:
LangChain is a framework that makes it easier to create applications using large
language models (LLMs). It provides:

 Tools: These are resources that help in building and customizing applications.
 Abstractions: These are simplified models or templates that hide complex
details, making development easier.

Developers use LangChain to create and tailor applications for various tasks, such as:

 Document Analysis: Understanding and extracting information from


documents.
 Summarization: Condensing long texts into shorter summaries.
 Chatbots: Building conversational agents that interact with users.
 Code Analysis: Analysing and understanding programming code.

In essence, LangChain helps streamline and simplify the development process for
applications that leverage the power of large language models.

SLIDE 12

This diagram depicts the LangChain framework for a question-answering system.


Here's a step-by-step explanation of each part of the framework:

1. Load: This step involves loading different types of data sources such as JSON
files, URLs, documents, and other file formats.
2. Split: The loaded data is then split into smaller, manageable chunks. This is
often done to ensure the data can be processed more efficiently and to fit
within the constraints of downstream models.
3. Embed: Each of these chunks is then transformed into a numerical
representation (embedding). This involves converting the textual data into
vectors that capture the semantic meaning of the text.
4. Store: The embeddings are stored in a database or some storage system. This
allows for efficient retrieval of relevant chunks based on a query.
5. Retrieve: When a question is posed to the system, it retrieves the most
relevant chunks of text from the stored embeddings. This retrieval step is
based on the similarity between the query and the stored embeddings.
6. Prompt: The retrieved chunks are then used to create a prompt for the
language model.
7. LLM (Large Language Model): The prompt is fed into a large language
model, which processes the information and generates an answer.
8. Answer: The final step is delivering the answer generated by the language
model to the user.

This flow enables the system to answer questions by leveraging stored embeddings
and a large language model, ensuring efficient and relevant retrieval of information
for accurate responses.

SLIDE 15
The Text Splitting pre-processing step is crucial for managing large texts and making
them more suitable for processing by downstream components. The
RecursiveCharacterTextSplitter class is designed to handle this task efficiently. It
operates by recursively splitting text based on a list of user-defined characters,
ensuring that related pieces of text remain adjacent to each other, thus preserving their
semantic relationship.
Text Splitting with RecursiveCharacterTextSplitter
Purpose
 To break large texts into smaller, manageable pieces that make sense on their
own.
How It Works
1. Start with Large Text:
o Begin with a big text, like a long article or document.
2. Initial Split:
o First, divide the text into big chunks, like paragraphs.
3. Check Chunk Size:
o If any chunk (paragraph) is too big, split it further.
4. Recursive Splitting:
o Split the big chunks into smaller pieces, like sentences.
o If a sentence is too long, split it into smaller parts, like phrases.
5. Keep Going:
o Repeat this process until all pieces are a manageable size.
Benefits
 Easier to Handle: Smaller pieces are easier to work with.
 Makes Sense: Each piece still makes sense on its own.
 Flexible: Adjusts to different text sizes automatically.
Example
 You have a long document.
 Split it into paragraphs.
 If a paragraph is too long, split it into sentences.
 If a sentence is too long, split it into shorter parts.
This way, you end up with small, meaningful chunks of text that are easy to process
further.
Slide 16

Chunk Size:
 Refers to the maximum desired length of each individual chunk in
characters.
 It sets a target upper limit for the number of characters in each segment.
 In your code, chunk_size is set to 1000, indicating that ideally, each chunk
should contain no more than 1000 characters.

Chunk Overlap:

 Defines the number of characters by which consecutive chunks will


overlap.
 This helps to maintain context when processing the split text later.
 It ensures that adjacent chunks share some content, especially at the end of the
first chunk and the beginning of the next.
 In your code, chunk_overlap is set to 200. This means the last 200 characters
of one chunk will be included at the beginning of the next chunk.

SLIDE 17

Embeddings

In Natural Language Processing (NLP), an embedding is a representation of text


where words, phrases, or entire documents are mapped to vectors of real numbers.
These vectors capture the semantic meaning of the text in a way that is understandable
to machine learning models.

• Dimensionality Reduction:

• Embeddings reduce the high-dimensional space of text (e.g., vocabulary size)


into a lower-dimensional continuous vector space.

• Semantic Similarity:

• Words or phrases with similar meanings are represented by vectors that are
close to each other in the embedding space. For instance, the words "king" and
"queen" would have vectors that are closer together than "king" and "car.

Slide 18:

INSTRUCTOR Embeddings

 Task Specificity: By incorporating task instructions, INSTRUCTOR


embeddings are more accurately tailored to the specific needs of the task,
leading to better performance.
 One Model, Many Tasks: A single INSTRUCTOR model can handle various
tasks effectively, reducing the need for multiple specialized models.
 Robustness: INSTRUCTOR embeddings are robust to changes in
instructions, meaning they can adapt to slightly different task descriptions
without significant performance loss.
Aspect Traditional Embeddings INSTRUCTOR Embeddings
Combine text input with task
Create a single, fixed
Design instructions to create task-specific
representation of text.
embeddings.
Task-agnostic; the same Task-aware; embeddings vary
Functionality embedding is used for based on provided task
different tasks. instructions.
Requires additional fine-
One model can adapt to multiple
Task Adaptation tuning or separate models for
tasks using instructions.
different tasks.
General-purpose, but not Highly versatile, optimized for
Versatility optimized for specific tasks multiple tasks without needing
without extra steps. separate models.
Good performance, but may State-of-the-art performance
Performance
require task-specific tuning. across a wide range of tasks.
Fixed embeddings may not Robust to changes in task
Robustness handle changes in task instructions, adaptable to slightly
requirements well. different descriptions.
Embedding for "Apple is a Embedding changes based on
Example Usage great company." is the same whether the task is sentiment
regardless of the task. analysis or entity recognition.
More complex, integrating task
Implementation Simpler, often requiring task-
instructions directly into the
Complexity specific adaptations.
embedding process.

Slide 19

1. Architecture: INSTRUCTOR uses GTR (General Text


Representation) models, initialized from T5 (Text-to-Text Transfer
Transformer) models, and finetuned on information search
datasets.

Summary of GTR Model in INSTRUCTOR

• Base Architecture: Uses T5 model as the foundational architecture.


• Fine-Tuning: Further trained on specific datasets to improve task
performance.
• General Text Representation: Aims to create high-quality embeddings
usable across multiple tasks.
• Task-Specific Embeddings: Produces embeddings that are tailored to the task
described by the instructions provided with the text.
• Versatility: Capable of handling diverse tasks without needing separate
models.

In essence, the GTR model in the INSTRUCTOR framework is a specialized version


of the T5 model that has been fine-tuned to generate embeddings that are both
general-purpose and task-aware, providing high performance and adaptability for a
wide range of text processing tasks.
T5 stands for Text-To-Text Transfer Transformer. It's a type of transformer-based
model developed by Google Research that is trained to perform various text-related
tasks by converting both the input and output into text-to-text format. This approach
allows T5 to handle a wide range of tasks with a unified architecture, including tasks
like translation, summarization, question answering, and more, by framing them all as
text generation tasks.

2. Training Objective:

 Contrastive loss: This loss function pushes the model to create embeddings
(numerical representations) that group similar texts together and separate
dissimilar texts. This helps the model distinguish between relevant and
irrelevant information during information search.

3. MEDI Dataset: The Training Ground

MEDI (Multitask Embedding Data with Instructions) is like INSTRUCTOR's


training gym.

 Variety is Key: MEDI contains 300 datasets from a source called Super-
Natural Instructions. These datasets cover many different tasks, like
classification, summarization, and question answering. Each task comes with
specific instructions, guiding INSTRUCTOR on how to handle it.
 Extra Practice: MEDI also includes 30 additional datasets, giving
INSTRUCTOR even more examples to learn from and improve its text
understanding skills.

Slide 20

Retriever Overview

A retriever's primary function is to fetch the most relevant documents or data points
from a vector store based on a user's query. Here's how it accomplishes this:

1. Vector Store Backbone: The vector store serves as the repository where
documents or data points are stored as vectors (numeric representations).
These vectors capture essential features or characteristics of each document or
data point.
2. Query Processing: When a user submits a query to the retriever, it takes this
query and uses it to identify which documents or data points in the vector store
are most relevant to the query.

Search Methods

The retriever employs several methods to determine relevance:

1. Similarity Search Method (Traditional Approach):


o Description: This method retrieves documents or data points that are
most similar to the query vector in terms of their vector
representations.
o Process:
 Computes similarity scores between the query vector and
vectors of all documents or data points in the vector store.
 Ranks the documents or data points based on these similarity
scores.
 Returns the top-ranked documents or data points as the most
relevant results.
2. Maximum Marginal Relevance (MMR):
o Description: MMR is used to balance relevance and diversity in search
results.
o Process:
 Initially retrieves the most relevant document or data point
based on similarity.
 Subsequent retrievals prioritize diversity by selecting
documents or data points that are dissimilar to those already
retrieved, ensuring a broader range of information is presented.
 This method helps prevent redundancy and provides a more
comprehensive view of the topic.
3. Specifying Top k:
o Description: Allows users to specify the number of top-ranked
documents or data points they want to retrieve.
o Process:
 After retrieving relevant documents or data points based on
similarity or MMR, the retriever limits the results to the
specified number, such as the top 5 or top 10 most relevant
items.
 This customization allows users to control the depth and
breadth of the retrieved information according to their needs.

Implementation and Use

 Efficiency: These methods are designed to efficiently retrieve relevant


information from the vector store without storing all documents internally.
 Flexibility: The retriever can adapt to different search requirements, from
focusing purely on similarity to incorporating diversity through MMR,
providing users with tailored and meaningful results.

In summary, a retriever leveraging vector stores as its backbone employs various


search methods like similarity search, MMR, and top-k specifications to efficiently
find and present the most relevant documents or data points to users based on their
queries. These methods enhance usability and effectiveness in information retrieval
tasks

SLIDE 24

The code snippet you're referring to is using a retriever with the maximal marginal
relevance (MMR) search strategy in a vector database (vectordb). This approach
aims to balance relevance and diversity of the retrieved documents.

Here’s a brief explanation and a possible setup in Python:


python
Copy code
# Assuming vectordb is already set up and connected
retriever = vectordb.as_retriever(search_type="mmr",
search_kwargs={"k": 3})

 search_type="mmr": Specifies that the retriever should use Maximal


Marginal Relevance. This helps in retrieving results that are both relevant and
diverse.
 search_kwargs={"k": 3}: Limits the number of retrieved documents to 3.

To provide more context or an example of how this can be integrated into a larger
system, please let me know more about your specific use case or the surrounding
code.

SLIDE 25

Large Language Models (LLMs) are powerful AI models trained on massive amounts
of text data to understand and generate human language. Here are some key features:

1. Understanding Text: LLMs analyze the meaning of sentences, identify


relationships between words, and grasp the overall sentiment of a piece of text.
2. Generating Text: They can generate creative text in various formats,
including articles, stories, poems, and more, based on the input and desired
style.
3. Adaptability: LLMs demonstrate adaptability by being able to perform
different tasks such as text summarization, translation, question answering,
and more. They can also adapt to different contexts, learning from new data to
improve performance.

These features make LLMs versatile tools for natural language processing tasks
across different domains and applications.

SLIDE 26

TheBloke/wizardLM-7B-HF is an enhanced version of Llama (Large Language


Model Meta AI), developed by TheBloke, known for their work in the field of large
language models. Here are its key features:

 Understanding and Generating Text: Like other large language models,


wizardLM-7B-HF can understand the meaning of text and generate human-
like responses across various formats and styles.
 Actionability: This model is designed to not only understand and generate
text but also to act upon instructions in a more comprehensive and flexible
manner. This could involve tasks such as executing commands based on
textual input or interacting with systems through natural language interfaces.
 Scale: It boasts a size of 7 billion parameters (7B), indicating its capacity to
handle complex language tasks and generate nuanced responses.
 Hugging Face (HF): Hosted on the Hugging Face model hub, it benefits from
the ecosystem and community support provided by Hugging Face.
Overall, TheBloke/wizardLM-7B-HF represents an advanced iteration of large
language models, emphasizing both language understanding and actionable responses,
suitable for a wide range of applications requiring sophisticated AI capabilities.

You might also like