0% found this document useful (0 votes)
18 views

BDIA Fall2024 Assignment2 3

Uploaded by

saivivek reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

BDIA Fall2024 Assignment2 3

Uploaded by

saivivek reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment 2

Due: Oct 11th 03:59 pm

Additional notes:
1. Required attestation and contribution declaration on the GitHub page:
WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR
ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK
Contribution:
a. member1: 33%
b. member2: 33%
c. member3: 33%

2. Make sure you do not push anything to your GitHub after due date
3. Create a Code lab document describing everything you did. In your GitHub you
should have a readme.md files which would tell what all things are there in this
GitHub repository.
4. Keep your repository private until the deadlines. In case of any plagiarism cases
both the teams which be equally held responsible

Instructions:
Automating Text Extraction and Client-Facing Application Development

This assignment consists of two parts. You are required to design, implement, and deploy
both components, with a focus on automation, user interaction, and security.

Part 1: Automating Text Extraction and Database Population

Airflow Pipelines:

1. Create Airflow pipelines to automate the data acquisition process for PDF files in
the GAIA dateaset
2. Implement the pipeline for processing a list of files from the GAIA Benchmarking
Validation & testing Dataset.
3. Integrate the chosen text extraction option (Pypdf, AWS Textract, Adobe PDF
Extract, or Azure AI Document Intelligence) into the pipeline for efficient text
extraction. You should have one opensource option and another API/enterprise
option.

The goal of this part is to streamline the process of retrieving and processing documents,
ensuring the extracted information is accurately populated into the data storage (S3 etc)

Part 2: Client-Facing Application using Streamlit and FastAPI

FastAPI:

1. Implement user registration and login functionality within your application.


2. Secure the application using JWT (JSON Web Token) authentication. All API
endpoints, except for registration and login, must be protected by JWT.
3. Upon successful authentication, users will receive a JWT token, which they must
include in subsequent requests to access protected endpoints.
4. Ensure the protected endpoints are visually indicated in the Swagger UI with a
padlock icon.
5. Utilize a SQL database for storing user login credentials, including hashed
passwords for security.
6. Move all the business logic to a backend service hosted over Fast API
7. Create a list of services you plan to invoke through Stramlit

Streamlit:

1. Develop a user-friendly registration and login page that allows users to create
accounts and log in securely.
2. After successful login, provide access to the Question Answering interface.
3. This interface will enable users to ask questions or submit queries.
4. Implement functionality allowing users to select from a variety of preprocessed
PDF extracts ( either opensource/API based extract)
5. If the user selects a specific PDF, the system should query only that specific file
from the store

Deployment:
1. Ensure that the fastAPI and Streamlit applications are fully containerized and
deployed to a public cloud platform using docker compose
2. The deployed applications should be publicly accessible, providing seamless
interaction for users.
Submission:

1. Your submission must include the fully functional Airflow pipelines, Streamlit
application, and FastAPI backend.
2. Ensure all services are deployed and publicly accessible, with proper
documentation for users to interact with your application.
3. Ensure you start with https://github.com/features/issues assign specific tasks and
sequence it so you can optimize your implementation.
4. GitHub Repo Link with
a. Project summary, research, PoC and other information
b. Your Github project and issues (https://github.com/features/issues)
c. Diagrams
d. A fully documented codelabs
e. Video of the submission (5 minutes)
f. Link to hosted applications, backend and data processing services
g. GitHub project

You might also like