0% found this document useful (0 votes)

10 views

SE Final Documentation

Uploaded by

kshitij.d.shah.999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

SE Final Documentation

Uploaded by

kshitij.d.shah.999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

SOFTWARE ENGINEERING LAB

DOCTOR’S MULTI-MODAL RAG AI-ASSISTANT

Name: Kshitij Shah

Roll No: 221070060
Batch: B
Branch: Computer Science

Phase 1: Foundational Details and Problem Analysis

1.1 Problem Statement

Title: Multi-Modal Retrieval-Augmented Generation (RAG) for

Healthcare
Description:
Modern medical professionals are overwhelmed by the sheer volume of
data they must manage daily—clinical notes, research papers, medical
images, case studies, and even video resources like surgeries or patient
consultations. This project aims to develop a comprehensive AI
Assistant for Doctors, leveraging Multi-Modal Retrieval-Augmented
Generation (RAG) to synthesise information across various data
formats, including PDFs, images, websites, and YouTube videos.

Key Challenges:

● Effective data scraping and preprocessing from diverse formats.

● Ensuring data privacy and compliance with medical regulations.
● Providing accurate and context-aware insights without biases.

Objectives:

1. Streamline data retrieval and analysis for doctors.

2. Support informed decision-making with multi-modal data
synthesis.
3. Enhance productivity by automating tedious information
aggregation tasks.
4. Ensure secure and ethical usage of sensitive medical data.

1.2 Software Scope

The software will cover the following functionalities:

1. Multi-Modal Data Input: Accepts PDFs, medical images, website

4. Scope of the Multi-Modal Retrieval-Augmented Generation
(RAG) for Doctors

The scope of the AI-powered assistant for doctors is vast, focusing on

data ingestion, processing, and intelligent retrieval from diverse
multimodal sources. The project aims to enhance clinical
decision-making by providing real-time insights from uploaded PDFs,
images, websites, and YouTube videos.

1. Purpose

● To provide doctors with a consolidated AI platform capable of

handling diverse data formats and extracting actionable insights.
● Reduce the time spent on data analysis by enabling faster,
accurate, and context-aware responses.

2. Functional Scope

The platform will enable:

1. Multimodal Data Integration:

○ Process PDFs, medical images, and website content to
extract relevant information.
○ Utilize APIs like YouTubeTranscriptApi for automatic
transcription and summarization of medical video content.
2. Contextual Query Handling:
○ Use advanced RAG architecture with Llama 3.2 to retrieve
and synthesize content relevant to user queries.
3. Scalable Deployment:
○ Build the application to scale across clinics and hospitals
with robust deployment strategies, ensuring reliability and
performance.

3. Features
Feature Description

PDF Parsing Extract textual and tabular data using libraries

like PyMuPDF.

Image Data Utilize transformers and models for text/image

Extraction analysis.

Web Content Scrape websites using BeautifulSoup for

Retrieval contextual data extraction.

Video Convert video transcripts into meaningful

Summarization summaries for doctors.

Interactive UI Build an intuitive UI using Streamlit for easy data

uploads and query handling.

4. Non-Functional Scope
Aspect Description

Performance Ensure response time <2 seconds for most

queries.

Scalability Handle concurrent requests from multiple

users.

Security Encrypt sensitive data using modern

cryptographic libraries.

Compatibility Support various platforms including desktop

and mobile devices.
5. Limitations

● Excludes processing of dynamic or large multimedia files due to

initial storage constraints.
● YouTube videos exceeding 30 minutes will not be processed in the
current version.

Visualisations

1. System Workflow Diagram:

○ A flowchart showing data flow from multimodal input to
RAG processing and query resolution.
○ Key Components:
■ Input Sources (PDFs, Images, Websites, Videos)
■ Preprocessing Layers (Tokenization, OCR, API Calls)
■ RAG Pipeline (Data Indexing, Query Matching, Output
Synthesis)
2. Modular Architecture:
○ Illustrate subsystems like the multimodal input processor,
core Llama 3 integration, and deployment layer.

Out of Scope:

● Diagnosing patient conditions directly.

● Data processing of non-medical content.
● Handling physical or handwritten medical documents.

Assumptions:

● Input data is structured and related to healthcare.

● Users have basic technical knowledge to upload files and provide
URLs.
Phase 1B: Estimation & Risk Analysis

We will address resources, costs, and risk planning next. For now, I’ll
proceed with Resource Estimation.

1.3 Estimated Resources

1. Technical Resources:
○ Development Tools: Streamlit, FAISS, PyTorch, NLTK, Groq
API, Transformers library.
○ Hardware:
■ Development machine: NVIDIA GPUs with 16GB VRAM
or higher.
■ Storage: At least 1TB SSD for model weights, datasets,
and logs.
○ APIs/Frameworks:
■ Groq API for Llama model integration.
■ PyMuPDF for PDF parsing.
■ YouTube Transcript API for video transcription.
2. Human Resources:
○ Project Manager (1): To oversee timelines, deliverables, and
team coordination.
○ Developers (2-3): Skilled in Python, machine learning, and
web development.
○ QA Engineers (1-2): For software quality and validation.
○ UI/UX Designer (1): To ensure user-friendly interaction.
3. Time Resources:
○ Development Period: 12-16 weeks.
○ Testing and Quality Assurance: Final 2-3 weeks.
○ Project Timeline Breakdown:
1. Phase 1 (Weeks 1-3): Requirements gathering, research, and
prototype design.
2. Phase 2 (Weeks 4-7): Core development—RAG implementation
with multimodal data processing.
3. Phase 3 (Weeks 8-10): Testing, optimization, and integration.
4. Phase 4 (Weeks 11-16): Deployment, quality assurance, and user
feedback incorporation.

1.4 Time and Cost Estimation

Time estimation uses the COCOMO II Model, considering factors like

system complexity, team expertise, and resources.

Time Estimation Table

Activity Effort (in Timeline Team Involved

Person-Weeks) (Weeks)

Requirements 4 PW Weeks 1-3 PM, Developers,

Analysis Domain Experts

Prototype 2 PW Weeks 2-3 Developers, UI/UX

Design Designer

Core 16 PW Weeks 4-7 Developers, API

Development Integrators

Testing and 10 PW Weeks 8-10 QA Engineers

Optimization

Deployment and 8 PW Weeks All Teams

QA 11-16

Total Effort: 40 Person-Weeks

Team Size: 5 people (1 PM, 3 developers, 1 QA/Designer)
Cost Estimation - Assuming the following salaries per hour:

● Project Manager: ₹1,200/hour

● Developer: ₹800/hour
● QA Engineer: ₹700/hour
● UI/UX Designer: ₹600/hour

Activity Hour Resource Cost Team Cost (₹)

s (₹)

Requirements 60 (1×₹1,200) + ₹3,600 + ₹14,400 =

Analysis (3×₹800) ₹18,000

Prototype Design 40 (3×₹800) + ₹2,400 + ₹600 =

(1×₹600) ₹3,000

Core Development 200 (3×₹800) ₹48,000

Testing and 100 (1×₹700 + ₹23,000

Optimization 2×₹800)

Deployment and 120 (1×₹1,200) + ₹26,800

QA (2×₹700)

Total Cost: ₹1,18,800

2. Flowchart: Multi-Modal RAG Workflow

Here is a simplified data flow for your Multi-Modal RAG system:

1. Input Sources: PDFs, Images, Websites, YouTube Videos

2. Preprocessing: Data extraction (PyMuPDF, BeautifulSoup,
YouTubeTranscriptAPI).
3. Data Indexing: FAISS similarity search for efficient retrieval.
4. Model Processing:
○ Text: Processed via Llama 3
○ Images: Processed via LLAVA
5. Output: Contextual insights and assistance delivered via
Streamlit UI.
1.5 Risk Analysis and RMMM plan:

Risk analysis is the process of identifying potential issues that may

affect the project, assessing their impact, and preparing mitigation
strategies. In this case, for a multi-modal RAG system for doctors,
potential risks can arise from several areas, such as technology, data
sources, user adoption, and integration challenges. Here’s how you can
structure it:

Risk Identification

1. Data Integrity Risks:

○ Description: Inconsistent or corrupt data from the PDF,
image, and video sources might lead to incorrect insights.
○ Impact: Could compromise the accuracy of the AI assistant's
responses, especially when assisting doctors in clinical
decision-making.
○ Likelihood: Medium
2. System Integration Risks:
○ Description: Difficulty integrating the multi-modal data
sources (PDFs, images, YouTube videos) effectively.
○ Impact: Could delay project milestones, hinder data
processing, or affect system performance.
○ Likelihood: High
3. Performance Risks:
○ Description: Slow processing due to large video or image
files and deep learning models.
○ Impact: Poor user experience, especially for doctors
needing quick responses.
○ Likelihood: High
4. Model Inaccuracy Risks:
○ Description: Risk of the model misunderstanding or
misinterpreting clinical data, which could lead to errors in
assisting doctors.
○ Impact: Could result in incorrect diagnosis or medical
advice, posing a serious risk.
○ Likelihood: Medium
5. Compliance Risks:
○ Description: The AI model may fail to comply with medical
regulations (e.g., HIPAA) or data privacy laws (GDPR).
○ Impact: Could lead to legal liabilities and loss of trust.
○ Likelihood: Low

Risk Mitigation Strategies:

1. Data Integrity:
○ Implement data validation checks for the PDFs, images, and
video files before processing.
○ Use OCR libraries for image-based text extraction (e.g.,
Tesseract).
○ Preprocess and clean the data sources to avoid corrupt or
incomplete data.
2. System Integration:
○ Break down the integration into smaller modules and test
each data type (PDF, image, video) before full integration.
○ Use modular APIs to scrape and process data in separate
layers.
3. Performance:
○ Optimize deep learning models by fine-tuning the
parameters for lower computational overhead (using
libraries like PyTorch and Groq).
○ Use GPU acceleration for processing large files and videos.
4. Model Inaccuracy:
○ Regularly train the AI model with accurate and up-to-date
medical datasets.
○ Implement a feedback loop where doctors can verify the
AI's answers and correct it when necessary.
5. Compliance:
○ Ensure the system stores and processes data according to
relevant medical and data protection regulations.
○ Encrypt sensitive patient data and implement strong access
control mechanisms.

RMMM Plan (Risk Mitigation, Monitoring, and Management Plan)

The RMMM plan defines how risks will be managed, mitigated, and
monitored throughout the project.

Risk Mitigation Strategies:

● Pre-emptive Data Scraping Checks: Implement a monitoring

system for data quality and integrity before processing inputs like
PDFs, images, and videos.
● Tech Stack Backup Plan: Use alternative tools in case one
technology fails (e.g., switch between different OCR tools or
scraping libraries).
● Model Retraining: Schedule periodic retraining of the model to
improve accuracy based on real-world usage.

Risk Monitoring Strategies:

● Periodic Testing: Regularly test the system for bugs and

performance bottlenecks.
● Continuous Monitoring Tools: Implement a logging system to
track errors and performance in real time.

Risk Management Strategies:

● Project Reviews: Hold weekly reviews to assess the risk status.

● Dedicated Risk Management Team: Assign roles to specific team
members for addressing and escalating risks.

1.6 Project Schedule Plan

1. Objectives

● Break down the project into manageable tasks.

● Define dependencies and allocate resources effectively.
● Identify milestones to track progress and ensure timely delivery.

2. Tools Used for Scheduling

● Gantt Chart for task visualization and dependencies.

● PERT/CPM Charts for time estimation and critical path
identification.
● Task Management Tools: Trello or Jira for real-time tracking.

Schedule Details

Week-wise Tasks and Milestones

We Task Deliverables Dependencies

1-3 Requirements Problem Statement, Stakeholder

Gathering and Scope Document Meetings
Research

4-7 Core Development Functional RAG Design Documents,

Pipeline, Multimodal Resource
Data Handling Availability

8-1 Testing and Bug-Free System, Core Development

0 Optimization Performance Metrics Completion

11- Deployment Deployment Plan, Successful Testing

13 Preparation Final SRS Document

14- Deployment, QA, Deployed System, Deployment

16 Documentation Project Report Readiness
Visual Timeline: Gantt Chart Breakdown

Task Dependencies

1. Requirements Gathering: Must be completed before

development begins.
2. Core Development: Critical path task with multiple
sub-dependencies (e.g., RAG pipeline, multimodal processing).
3. Testing: Cannot start until core features are implemented.
4. Deployment: Final phase, dependent on testing and integration.

Critical Path Analysis

Using a PERT/CPM chart, the critical path is identified as follows:

1. Requirements Gathering → Design → Core Development →

Testing → Deployment.
2. Total estimated time for the critical path: 12-14 weeks, with
buffer time for contingencies.

Gantt Chart: Project Schedule

A detailed schedule that allocates timeframes for each task and

milestone.

Task Start End Duration

Week Week (Weeks)

Requirements 1 3 3
Analysis

Prototype Design 2 3 2

Core Development 4 7 4

Testing and 8 10 3
Optimization

Deployment and 11 16 6
QA
Challenges and Mitigation

Challenge Impact Mitigation Plan

Resource Project Delays Early procurement

Unavailability and backups

Task Overlap Inefficiency Dependency

management in tools
Phase 2: Core Development

1.7 Software Quality Assurance (SQA) Plan

Introduction The SQA plan defines the activities and processes that
will ensure the software meets the required quality standards. It
outlines the objectives, strategies, and tools to be used to assess and
enhance the quality of the multi-modal RAG system.

Objectives of SQA

● Ensure that the software meets the specified requirements.

● Identify and resolve defects in the software early in the
development process.
● Ensure compliance with industry standards and best practices for
software development.
● Conduct testing across multiple modalities (PDFs, images,
websites, YouTube videos) to ensure data scraping accuracy and
system functionality.

SQA Activities

● Requirement Verification: Ensure the requirements defined in

the SRS are feasible and complete.
● Design Verification: Verify that the software design meets the
requirements.
● Code Quality Assurance: Conduct code reviews to ensure coding
standards and best practices are followed.
● Testing: Perform unit testing, integration testing, system testing,
and acceptance testing.
● Performance Monitoring: Ensure the system performs efficiently
with respect to data input from PDFs, images, websites, and
YouTube videos.

Standards and Best Practices

● Code Quality: Use static analysis tools such as SonarQube to
assess code quality.
● Testing Frameworks: Use PyTest for Python testing and integrate
with CI/CD pipelines for automated testing.
● Documentation: Ensure all code is properly documented, and
technical documentation is maintained for future reference.

Test Strategy

● Unit Testing: Each module (e.g., PDF data extraction, image

scraping) will be tested independently.
● Integration Testing: Test the interaction between different
components, such as the integration of the data extraction
modules with the RAG system.
● System Testing: Test the entire system’s ability to handle
real-world data and provide correct answers.
● User Acceptance Testing (UAT): Involve a group of doctors to
validate the system's usefulness and accuracy.

Risk Management

● Risks Identified:
○ Performance degradation when processing large PDFs or
images.
○ Accuracy of data extraction from websites or videos.
● Mitigation Plan:
○ Implement efficient algorithms and caching mechanisms to
handle large data inputs.
○ Use robust scraping and transcription techniques to ensure
data accuracy.

Tools and Libraries for SQA

● Testing Tools: PyTest, Unittest

1.8 Project Plan

The project is divided into four major phases as follows:

Phase 1: Foundation Details and Problem Analysis

● Understand the requirements for processing medical data (PDFs,

images, websites, YouTube videos).
● Analyze how data from these sources can be integrated into a
unified system.
● Implement initial functionality to process and extract content
from these sources (images, PDFs, webpages, and videos).
Phase 2: Core Development

● Develop the chunking and embedding mechanisms for processing

the text extracted from various inputs.
● Set up the FAISS index for efficient retrieval of relevant data
chunks based on a user's query.
● Integrate Groq's API for generating responses based on the
processed data.
● Ensure that the system can handle different input formats (PDFs,
images, URLs, and YouTube videos) smoothly.
Phase 3: Testing and Deployment

● Develop and test the user interface using Streamlit.

● Implement user authentication (login/signup) to secure access to
the system.
● Allow users to upload files and provide URLs for webpage and
YouTube video processing.
● Test the system with sample medical queries to ensure that
responses are accurate and relevant.
Phase 4: QA and Final Testing

● Conduct thorough testing to ensure the system works as expected

with different data inputs.
● Test the integration of different data sources (image, PDF,
webpage, and video) to ensure the system generates correct and
coherent responses.
● Perform user acceptance testing to ensure the system meets the
needs of medical professionals.
● Deploy the application for use by doctors or healthcare
professionals, with the ability to handle a variety of medical
queries.

Technical Project Plan:-

1. Data Ingestion and Processing

The first phase of the project involves gathering and preprocessing

different types of data inputs (text, images, PDFs, websites, and
YouTube videos). The functionality is implemented in various helper
functions:

● Image Processing:
○ encode_to_64: Converts images to base64 strings to send
them for processing.
○ image_to_text: Uses Groq's API to process an image and
extract textual information (e.g., description).
○ further_query: Allows users to ask further questions based
on the image's description.
○ complete_image_func: Combines the above functions to
process the image and generate responses to further
queries.
● PDF Processing:
○ extract_text_and_images_from_pdf: Extracts both textual
content and images from PDF files using the PyMuPDF
library.
● Web Scraping:
○ scrape_page: Scrapes text and images from a given
webpage. It uses BeautifulSoup to parse the HTML content,
and also saves any images to the local file system.
● YouTube Video Processing:
○ extract_video_id: Extracts the YouTube video ID from a
URL.
○ YouTubeTranscriptApi: Retrieves the transcript of the
YouTube video and extracts the text.

2. Data Chunking and Embedding

Once the data is collected, the system performs chunking and

embedding to prepare the data for a RAG-based response generation.

● chunk_content_by_sentence: Tokenizes the extracted content

into sentences for better chunking and later processing.
● generate_rag_response: Uses the query and model to generate
responses. It retrieves relevant chunks from the embedded
content using FAISS (a library for efficient similarity search).
Then, it combines these chunks to form the context for the
query.

The embeddings for each chunk are stored in a FAISS index, which is
used to retrieve the most relevant content based on the user's query.

3. User Interface using streamlit

The user interface is built with Streamlit, where users (medical

professionals) can interact with the system. The core features of the UI
include:

● Login and Signup Pages:

○ login_page: Allows users to log in with their credentials.
○ signup_page: Allows users to create an account. User
credentials are saved in a pickle file for simplicity.
● Main Page:
○ After successful login, users can interact with the medical
AI assistant by uploading PDFs, images, or providing URLs of
webpages and YouTube videos.
○ The system combines the various types of data (image
descriptions, extracted text from PDFs, webpage content,
and video transcripts) to provide a comprehensive answer
to user queries.
4. Integration and Query Handling

This phase involves integrating all the functions and ensuring smooth
interaction between the components. The query handling is done by
the following:

● final_func: This is the central function where all the data

sources (image, PDF, webpage, YouTube video) are processed,
and the most relevant content is retrieved. The function then
generates a response using the Groq API.
● The system uses a multi-modal RAG model to merge various types
of input data, allowing the medical AI assistant to generate
comprehensive answers based on the user's query.
1.9 Requirement Analysis Modeling

1.10 Software Requirements Specification (SRS)

1. beautifulsoup4

● Purpose: Web scraping and HTML parsing.

● Use Case: Medical websites often contain useful information in HTML format.
BeautifulSoup4 allows you to extract specific content from these websites, such as
articles, research papers, or medical guidelines. This content can be processed and
fed into your system for further analysis and response generation.
● Details: It parses HTML and XML documents, making it easy to search for tags,
extract text, and navigate through HTML trees. This is critical for web scraping when
the input source is a website.

2. faiss_cpu

● Purpose: Efficient similarity search and clustering.

● Use Case: In the context of your multi-modal RAG system, FAISS (Facebook AI
Similarity Search) helps speed up the retrieval of the most relevant information from
the database, such as extracting the best answers or pieces of text that match a
doctor’s query. It allows your system to efficiently perform similarity searches across
large datasets (like medical PDFs, images, or websites).
● Details: FAISS uses advanced indexing algorithms to enable fast and
memory-efficient similarity searches, which is critical for large-scale machine learning
tasks where fast retrieval is necessary. It can handle millions of vectors, making it
perfect for AI applications dealing with large volumes of data.

3. groq

● Purpose: Hardware acceleration, specifically for AI and machine learning workloads.

● Use Case: Groq accelerators are specialized hardware used to speed up machine
learning computations. In your system, Groq can help with the real-time generation of
medical information responses by offloading and speeding up deep learning
computations, especially for models handling complex inputs like images, videos,
and large text data.
● Details: It can reduce the latency and increase the speed of the model inference,
making it essential for systems that require fast, real-time performance, like your
RAG-based medical assistant.
4. langchain_ollama

● Purpose: Integration of language models with retrieval-augmented generation.

● Use Case: Langchain facilitates the integration of large language models (LLMs) like
those from Ollama with a retrieval system. This allows you to retrieve relevant
documents or chunks of text from medical sources and combine them with
generative responses from the LLM to provide accurate and context-aware answers
to medical queries.
● Details: This integration is crucial for your RAG system, where you use both retrieval
from a knowledge base (medical documents, research papers) and generation via a
language model to provide highly relevant, informative answers.

5. nltk

● Purpose: Natural Language Processing (NLP) tasks.

● Use Case: The Natural Language Toolkit (NLTK) provides tools for processing and
analyzing human language data. In your system, NLTK will help preprocess text data
from various sources (like PDFs, websites, or YouTube transcripts) by tokenizing,
stemming, removing stop words, and normalizing text.
● Details: It’s essential for text cleaning and preprocessing tasks, ensuring that the
input data fed into your RAG model is clean and ready for further analysis, which
enhances the accuracy and relevance of the responses generated.

6. numpy

● Purpose: Scientific computing and numerical operations.

● Use Case: Numpy is fundamental for numerical operations, especially in AI models
dealing with embeddings (numeric representations of text) and large datasets. For
example, your RAG system might generate embeddings of documents, and Numpy
will be used to manipulate and process these embeddings.
● Details: It’s used for matrix operations, handling large datasets, and performing
efficient numerical calculations, making it crucial for machine learning and deep
learning workflows.

7. opencv_python

● Purpose: Image processing and computer vision.

● Use Case: Medical images (e.g., scans, X-rays, and images from medical journals)
will be input into the system. OpenCV is a powerful library used to process and
manipulate these images, including resizing, transformations, and enhancing
features for further analysis.
● Details: OpenCV allows your system to apply various image processing techniques,
such as object detection, edge detection, and image segmentation, which are
necessary when dealing with medical image data.

8. Requests

● Purpose: HTTP requests and API interaction.

● Use Case: The Requests library allows your system to make HTTP requests to
external APIs or web servers. For example, it could be used to fetch YouTube video
transcripts via the youtube_transcript_api, or interact with other medical data
sources.
● Details: It’s one of the most popular Python libraries for making HTTP requests,
providing easy-to-use methods to interact with web services, allowing you to pull in
real-time data into your system.

9. streamlit

● Purpose: Creating interactive web applications.

● Use Case: Streamlit will serve as the front-end interface of your multi-modal RAG
system. Doctors will be able to upload PDFs, images, and videos, and interact with
the AI system to get responses to their queries. Streamlit simplifies the process of
building user-friendly, responsive web apps.
● Details: It provides an easy way to build real-time interactive applications, and in
your case, it will allow doctors to upload medical data, view AI-generated responses,
and engage with the system seamlessly.

10. torch

● Purpose: Deep learning and neural networks.

● Use Case: PyTorch is the backbone for deep learning operations in your RAG
system. It will be used for training and inference of the models, especially the
large-scale language models and the retrieval-augmented generation components.
● Details: PyTorch allows for flexible and efficient model building, offering great support
for tensor operations, dynamic computation graphs, and GPU acceleration. It’s one of
the most popular libraries for machine learning research and production.

11. transformers

● Purpose: Pretrained language models for NLP.

● Use Case: The transformers library by Hugging Face will be used to load
pre-trained large language models (LLMs) like GPT, T5, or BERT, which will process
and generate the responses to medical queries based on the inputs from PDFs,
images, websites, or YouTube videos.
● Details: It simplifies the process of working with transformer-based models, allowing
you to fine-tune models, load pre-trained weights, and use state-of-the-art NLP
techniques in your medical assistant system.

12. youtube_transcript_api

● Purpose: Fetching YouTube video transcripts.

● Use Case: Medical video content is often hosted on platforms like YouTube. This API
allows your system to extract transcripts from YouTube videos, turning spoken
medical content into text, which can then be processed and used to generate
responses or answers.
● Details: This library simplifies the process of extracting transcripts from videos,
allowing your system to convert spoken medical content into a textual format, which
is essential for understanding and answering questions related to medical video
content.

13. pymupdf

● Purpose: PDF text and image extraction.

● Use Case: Many medical documents (journals, research papers, reports) come in
PDF format. PyMuPDF will be used to extract text and images from these PDFs,
ensuring that doctors can input important documents into the system and retrieve
answers based on that content.
● Details: PyMuPDF (also known as Fitz) is efficient in extracting both text and images
from PDF files, making it ideal for medical reports, research papers, and scientific
articles.
1.11 Software Design

Software Design Overview

Diagram source

Diagram sorce
1. Data Design

Data design ensures that the data structures and formats used are
optimal for the application needs. The key components of data design
in your system are:

● User Credentials Storage:

○ User credentials (username and password) are stored in a
dictionary (user_db) for simplicity. In a production
environment, this should be replaced with a secure
database.
○ For persistence, user data is saved in a file
(user_credentials.pkl) using pickle to serialize the data.
● Content Storage and Chunking:
○ Text extracted from various sources (PDF, images, web
pages, YouTube videos) is combined into one large text
block and then chunked into sentences.
○ These chunks are embedded using a sentence transformer
(sentence-transformers/all-mpnet-base-v2), which creates
vectors for each chunk.
● Embedding and Indexing:
○ FAISS (Facebook AI Similarity Search) is used to store and
search for relevant chunks based on a query. Each chunk's
embedding is indexed to enable fast retrieval of the top k
most relevant contexts.

2. Architecture Design

The architecture can be categorized into several modules based on

functionality, such as:

● Frontend (Streamlit UI):

1. The frontend is developed using Streamlit, which allows for
rapid prototyping and deployment of machine learning
models with a clean and interactive interface.
2. The user is presented with an authentication system (login
and signup), followed by an interface to upload various
data types (PDFs, images, etc.) and interact with the
medical assistant.
● Backend (Model and Data Processing):
1. The backend handles the logic for processing inputs,
extracting information, and generating responses. This is
implemented using a combination of:
■ Text Extraction: Extracts text and images from PDFs,
web scraping, and YouTube transcripts.
■ Groq API Integration: The image processing and
further question answering based on images is
handled via the Groq API.
■ RAG-based Query Response: The model-based system
retrieves relevant context from the indexed chunks
and combines it with the user query to generate a
relevant answer.
● Data Flow:
1. Input: The user uploads a file, image, or provides a URL.
2. Processing: Text extraction and embedding generation
occur.
3. Search: FAISS is used to retrieve the most relevant chunks
based on the user query.
4. Output: The response, generated by the Groq API, is
presented back to the user.

3. Interface Design

Interface design focuses on how the system's components interact with

each other and with external services:

● User Interface (UI):

○ The UI allows users to login or sign up, upload PDFs or
images, and input URLs for web scraping or YouTube video
transcripts.
○ The sidebar is used for file and URL input, while the main
area is reserved for displaying responses to user queries.
○ Chat input and output are integrated, allowing users to
interact with the AI assistant.
● Groq API:
○ The Groq client interacts with the Groq API to process
images and generate descriptions or further query
responses. The communication between your backend
(Python code) and Groq's API is done using HTTP requests
and responses.
● External Libraries:
○ NLTK: For sentence tokenization.
○ FAISS: For fast similarity search in the embeddings.
○ Transformers (Huggingface): For generating embeddings of
text chunks and integrating large models (like llama).
○ YouTube Transcript API: To extract video transcripts from
YouTube.

4. Component Level Design

Component level design defines the individual components of the

system and their interactions. The main components are:

● Authentication:
○ The authentication system (login and signup) ensures that
users can securely access their personalized assistant.
● Text and Image Processing:
○ Image Processing: Uses the Groq API to process images.
Images are first encoded to base64 and then sent to Groq
for text extraction.
○ PDF Text and Image Extraction: Extracts both text and
images from PDFs, storing them locally for further
processing.
● Web Scraping:
○ Scrapes text and images from webpages using BeautifulSoup
and requests, storing the images in a local directory for
later use.
● YouTube Transcript Extraction:
○ Extracts YouTube video transcripts using the
YouTubeTranscriptApi, which is helpful for converting
spoken content into text that can be processed.
● RAG Response Generation:
○ This is the core function, where the system combines
various types of content (PDF, web text, video transcripts,
image descriptions) into a knowledge base, and the query is
matched with the most relevant content using FAISS. The
combined context is then fed into the Groq model to
generate an answer.

1.12 Coding and implementing Software

1. Required Libraries and Setup

You are importing several libraries for NLP, image processing, web
scraping, PDF handling, and interaction with APIs (like Groq and
YouTube Transcripts). This is good as it covers multiple input types and
use cases.

However, for clarity, you can modularize the imports into separate
sections for better readability.

2. Groq API Setup

The setup for the Groq API seems correct. However, ensure that the
API key you are using is valid and that you've tested the Groq-related
functions.

python
Copy code
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
GROQ_API_KEY = 'your_api_key'
client = Groq(api_key=GROQ_API_KEY)

Note: You should avoid sharing sensitive information like API keys
publicly. Always ensure that they are kept safe and encrypted in your
production environment.

3. Helper Functions

Your helper functions seem well-defined but could use some

improvements for clarity and efficiency.

● encode_to_64: Converts an image to a base64 encoded string.

● image_to_text: Processes an image using Groq's API. The
structure is solid but can be made more robust by handling
possible errors from the API.
● further_query: Handles further queries based on the description.
This is clear and functional.
● extract_text_and_images_from_pdf: Extracts both text and
images from PDFs. You correctly loop through the pages, extract
text, and store images. You could use a try-except block here to
handle potential errors like file corruption.
● scrape_page: Web scraping function is efficient but should
ideally handle more exceptions, such as network errors or issues
with image retrieval.
● extract_video_id: It correctly extracts video IDs from both
YouTube URLs. This function works as expected.

4. NLP Functions

These functions work to process text, chunk it, and compute

embeddings using the sentence-transformers model.

● chunk_content_by_sentence: Correctly splits the content into

sentences.
● generate_rag_response: This function seems to generate a
response using a Retrieval-Augmented Generation (RAG)
approach. You are using faiss for retrieving relevant chunks,
which is an excellent choice.

Note: Ensure the model is properly initialized and that your

embeddings and Faiss index are correctly computed.

5. Main Function (final_func)

This function is the core of your system. It integrates all data sources
(PDFs, images, YouTube, and web pages) and processes them to
generate a response based on user queries.

Notes for Improvement:

1. Handling missing or empty inputs: Before processing, it’s good

to add checks for None or empty files.
2. Error handling in YouTubeTranscriptApi.get_transcript: Add
exception handling here as the API may fail if the transcript is
not available for a video.
3. Refactor embedding generation: You can streamline the process
of generating embeddings by separating it into its function to
enhance readability.

6. Streamlit Interface

Your Streamlit interface looks good but can be improved for better
user experience and modularity.

You have functions like login_page(), signup_page(), and main_page()

for handling different parts of the UI. This separation is nice as it
makes your code maintainable.

Here are some points to improve:

● Ensure session state management: Before jumping to the main

page, ensure that the user session (st.session_state.logged_in) is
correctly set after a successful login.
● File Upload Handling: Proper handling of file uploads for PDF and
image inputs. You're saving files correctly, but adding user
feedback or progress bars would improve the UX.
● Styling: Consider adding more interactive components, such as a
file preview (for uploaded PDFs/images).

Phase 2: Testing
1.13 Developing test cases for the software
Test_cases.xlsx

Developing comprehensive test cases is a critical step in ensuring the

robustness, reliability, and functionality of the multi-modal
Retrieval-Augmented Generation (RAG) system. The test cases are
designed to evaluate various modules of the software, including data
ingestion, processing pipelines, and response generation across
multiple modalities (text, image, video, and web scraping).

Objectives of Test Cases

1. Functional Verification: Ensure each module functions as

intended, handling standard and edge cases effectively.
2. Error Handling: Validate the system's ability to gracefully handle
invalid inputs and unexpected scenarios.
3. Performance Testing: Assess the response time and accuracy of
RAG results for various inputs.
4. Integration Testing: Verify the seamless integration of
sub-systems such as NLP models, APIs, and external libraries.

Test Case Categories

1. Image Processing: Evaluate the encoding of images to base64,

image description generation using the Groq API, and further
query handling based on image content.
2. Text Extraction: Test the extraction of text and images from
PDFs and web pages, ensuring accuracy and completeness.
3. Video Analysis: Validate YouTube transcript retrieval and
processing for query-based responses.
4. Data Chunking: Check the chunking of text data into manageable
sentences for efficient RAG indexing.
5. Response Generation: Confirm the relevance and contextual
accuracy of responses generated using the RAG approach.
6. Authentication: Ensure robust user authentication and
registration processes.
7. Error Scenarios: Test the system's response to invalid inputs,
network errors, and corrupted files.
Example Test Case Highlights

● Image to Text Description (TC01): Input a valid JPEG image and

verify if the base64 encoding and Groq API output match the
expected description.
● PDF Text Extraction (TC05): Input a well-structured PDF file and
check if the extracted text and images align with the file's
contents.
● YouTube Transcript (TC07): Test the system's ability to extract a
video ID and retrieve a complete transcript for a given YouTube
URL.
● Final Query Integration (TC12): Provide inputs from all
modalities (image, text, video, web) and verify if the system
generates a cohesive and relevant response.

Challenges Addressed by Test Cases

● Handling multi-modal inputs with varying formats and quality.

● Managing dependencies on external APIs (e.g., Groq, YouTube
Transcript API).
● Ensuring accurate chunking, indexing, and retrieval of context
from large datasets.
● Validating error resilience and graceful degradation under
adverse conditions.