DL-9
DL-9
DATE : 5.11.24
Problem Statement:
In today's digital age, the ability to efficiently manage documents is crucial for organizations.
However, a significant challenge arises when dealing with non-machine-readable documents such as
PDFs or Word documents. These formats hinder automation and make it difficult to extract meaningful
insights from the data they contain. Therefore, there is a pressing need for a solution that can both
restrict the ingestion of non-machine-readable documents and facilitate the creation of
machine-readable documents seamlessly. Description: The above problem statement envisages: 1. To
develop an application that can restrict software applications from ingesting any non-machine-readable
document format such as PDFs, DOCs, or any other document types. 2. To create a mechanism within
the application to generate machine-readable documents automatically whenever a new document is
created, regardless of its source
Objective:
The primary objective of this project is to develop an advanced automated document understanding pipeline that
accurately transforms unstructured, non-machine-readable inputs such as handwritten and printed scanned
documents into structured, machine-readable formats like JSON, PDF, and CSV. By leveraging technologies such as
Optical Character Recognition (OCR) and layout-aware language models, the system aims to:
○
Frontend Technologies:
■ HTML, CSS, and JavaScript for creating an intuitive web interface for document upload
and preview.
○ Backend Support:
■ Secure transfer of uploaded files to the server for processing.
○ Input Preprocessing Techniques:
■ Robust preprocessing to enhance text extraction and layout analysis.
➢ Document Detection Algorithm:
○Feature Extraction:
■ Visual and textual features, including text density, font variations, and spatial distribution.
○ Deep Learning Classifier:
■ Classifies documents into categories (handwritten, printed, scanned bills) using pre-trained
models.
○ Datasets Used for Model Training:
■ IAM Handwriting Dataset: For handwritten text.
■ SynthText Dataset: For printed text detection.
■ RVL-CDIP Dataset: For recognizing scanned financial documents (e.g., invoices, bills).
➢ Optical Character Recognition (OCR):
○
Output Format Conversion Algorithms:
■ PDF Generation: Reconstructs the layout and embeds extracted text into a PDF.
■ JSON Structuring: Stores text and layout metadata in a structured, machine-readable
format.
■ CSV Conversion: Extracted tabular data is transformed into a format suitable for
spreadsheet tools.
○ Frontend Integration for Output Delivery:
➢ Allows users to download the desired output format (PDF, JSON, or CSV).
➢ Advanced Deep Learning Techniques:
Code
'overwrite_cache': True,
'data_dir': '/content/data',
'model_name_or_path':'microsoft/layoutlm-base-uncased',
'max_seq_length': 512,
'model_type': 'layoutlm',}
class AttrDict(dict):
self.__dict__ = self
args = AttrDict(args)
tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")
# the LayoutLM authors already defined a specific FunsdDataset, so we are going to use this here
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset,
sampler=train_sampler, batch_size=2)
eval_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="test")
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset,
sampler=eval_sampler,
batch_size=2)
import torch
model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased",
num_labels=num_labels)
model.to(device)
global_step = 0
num_train_epochs = 5
model.train()
input_ids = batch[0].to(device)
bbox = batch[4].to(device)
attention_mask = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# forward pass
labels=labels)
loss = outputs.loss
if global_step % 100 == 0:
loss.backward()
#print(model.classifier.weight.grad[6,:].sum())
# update
optimizer.step()
optimizer.zero_grad()
global_step += 1
import numpy as np
classification_report,
f1_score,
precision_score,
recall_score,
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
model.eval()
with torch.no_grad():
input_ids = batch[0].to(device)
bbox = batch[4].to(device)
attention_mask = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# forward pass
labels=labels)
tmp_eval_loss = outputs.loss
logits = outputs.logits
eval_loss += tmp_eval_loss.item()
nb_eval_steps += 1
if preds is None:
preds = logits.detach().cpu().numpy()
out_label_ids = labels.detach().cpu().numpy()
else:
out_label_ids = np.append(
for i in range(out_label_ids.shape[0]):
for j in range(out_label_ids.shape[1]):
if out_label_ids[i, j] != pad_token_label_id:
out_label_list[i].append(label_map[out_label_ids[i][j]])
preds_list[i].append(label_map[preds[i][j]])
results = {
"loss": eval_loss,
"precision": precision_score(out_label_list, preds_list),
print(results)
PATH='./layoutlm.pt'
torch.save(model.state_dict(), PATH)
import pytesseract
#image = Image.open('/content/form_example.jpg')
image = Image.open("/content/data/testing_data/images/83443897.png")
image = image.convert("RGB")
image
Results
1. Dataset Performance:
○ IAM Handwritten Dataset: Achieved 97.38% recognition accuracy for handwritten text.
○ CVL Dataset: Improved robustness across diverse handwriting styles.
2. Output Quality:
○ Average processing time: 1.5 seconds per page on a Tesla T4 GPU, suitable for real-time
applications.
Discussion
1. Effectiveness:
○ Fine-tuning DocTR and LayoutLM on datasets like IAM, CVL, and FUNSD significantly improved
text recognition and layout understanding.
2. Strengths:
Results achieved
Conclusion
The proposed document processing pipeline successfully transforms non-machine-readable documents into
structured, machine-readable formats with high accuracy. By integrating OCR models like DocTR and layout-aware
models like LayoutLM, the system efficiently handles complex layouts, preserving text structure and contextual
relationships. Its adaptability to diverse document types—handwritten, printed, or mixed—and support for multiple
output formats (JSON, PDF, CSV) make it versatile for various industries, including finance, healthcare, and legal
services.This automated workflow reduces manual effort, improves data accessibility, and supports seamless
integration with downstream applications, streamlining workflows and unlocking the potential of unstructured data.
Future expansions can further enhance its utility and scalability, ensuring its relevance in digitization and data
transformation efforts.
Future Work
1. Expand multilingual and domain-specific datasets to improve adaptability for diverse document types.
2. Enable real-time processing for video streams and live feeds, enhancing usability in dynamic applications.
3. Optimize for mobile and edge deployment, ensuring offline processing and accessibility.
4. Integrate advanced models for handling complex layouts and improving handwriting recognition accuracy.
5. Enhance security with privacy-preserving techniques and scalability for enterprise-level document
processing.