BDIA Fall2024 Assignment2 3
BDIA Fall2024 Assignment2 3
Additional notes:
1. Required attestation and contribution declaration on the GitHub page:
WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR
ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK
Contribution:
a. member1: 33%
b. member2: 33%
c. member3: 33%
2. Make sure you do not push anything to your GitHub after due date
3. Create a Code lab document describing everything you did. In your GitHub you
should have a readme.md files which would tell what all things are there in this
GitHub repository.
4. Keep your repository private until the deadlines. In case of any plagiarism cases
both the teams which be equally held responsible
Instructions:
Automating Text Extraction and Client-Facing Application Development
This assignment consists of two parts. You are required to design, implement, and deploy
both components, with a focus on automation, user interaction, and security.
Airflow Pipelines:
1. Create Airflow pipelines to automate the data acquisition process for PDF files in
the GAIA dateaset
2. Implement the pipeline for processing a list of files from the GAIA Benchmarking
Validation & testing Dataset.
3. Integrate the chosen text extraction option (Pypdf, AWS Textract, Adobe PDF
Extract, or Azure AI Document Intelligence) into the pipeline for efficient text
extraction. You should have one opensource option and another API/enterprise
option.
The goal of this part is to streamline the process of retrieving and processing documents,
ensuring the extracted information is accurately populated into the data storage (S3 etc)
FastAPI:
Streamlit:
1. Develop a user-friendly registration and login page that allows users to create
accounts and log in securely.
2. After successful login, provide access to the Question Answering interface.
3. This interface will enable users to ask questions or submit queries.
4. Implement functionality allowing users to select from a variety of preprocessed
PDF extracts ( either opensource/API based extract)
5. If the user selects a specific PDF, the system should query only that specific file
from the store
Deployment:
1. Ensure that the fastAPI and Streamlit applications are fully containerized and
deployed to a public cloud platform using docker compose
2. The deployed applications should be publicly accessible, providing seamless
interaction for users.
Submission:
1. Your submission must include the fully functional Airflow pipelines, Streamlit
application, and FastAPI backend.
2. Ensure all services are deployed and publicly accessible, with proper
documentation for users to interact with your application.
3. Ensure you start with https://github.com/features/issues assign specific tasks and
sequence it so you can optimize your implementation.
4. GitHub Repo Link with
a. Project summary, research, PoC and other information
b. Your Github project and issues (https://github.com/features/issues)
c. Diagrams
d. A fully documented codelabs
e. Video of the submission (5 minutes)
f. Link to hosted applications, backend and data processing services
g. GitHub project