Final Project Report
Final Project Report
Data Deduplication
by
Shantanu Dev
Submitted to the
June 2023
The author hereby grants to Thapar Institute of Engineering & Technology, permission to
reproduce and to distribute publicly paper and electronic copies of this report document in
whole and in part in any medium now known or hereafter created.
Data Deduplication
by Shantanu Dev
June 2023
1
ABSTRACT
The first project involved performing data deduplication on the company's database,
specifically eliminating duplicate questions. Extensive research was conducted on various
deduplication techniques, leading to the development of a three-step Python script. This script
utilized OCR extraction, data cleaning, and clustering techniques to identify and eliminate
duplicate questions. The optimized script was presented to the team and stakeholders,
resulting in improved data quality and system performance.
The second project focused on real-time question indexing and recommendation using
Elasticsearch and Golang. Through the study of real-time deduplication methods,
OpenSearch, and ElasticSearch, a robust architecture was designed. Golang was employed to
implement indexing and querying APIs, providing efficient and personalized question
recommendations. The codebase was deployed using Docker Compose and Amazon ECS,
ensuring scalability and seamless integration with the front-end.
Throughout the internship, valuable skills were gained in data deduplication, real-time
indexing, Python, Golang, Elasticsearch, and cloud deployment. The projects' successful
completion contributed to the enhancement of the BYJU'S learning app, improving user
experience and system performance.
The report concludes with limitations encountered during the projects and suggestions for
future work, including further optimization, scalability, and enhancements to the
recommendation system.
This internship report serves as a comprehensive record of the tasks accomplished, skills
acquired, and the impact of the projects on the BYJU'S learning app. It provides insights into
the challenges faced, methodologies employed, and recommendations for future
development.
2
CERTIFICATE
Certified that project entitled “Data Deduplication” which is being submitted by Shantanu
Dev (101917052) to the Department of Computer Science & Engineering, TIET, Patiala,
Punjab, is a record of project work carried out by him/her under guidance and supervision of
Mr. Pratik Jain. The matter presented in this project report does not incorporate without
acknowledgment any material previously published or written by any other person except
where due reference is made in the text.
Industry Mentor
Mr. Pratik Jain
Principal Engineer
3
Faculty Mentor Signature
Verification of Employment
4
TABLE OF CONTENTS
Byju’s Profile 6
Introduction 8
Background 9
Objectives 11
Methodology 12
Limitations 26
Bibliography 30
5
1. Company Profile
learning techniques to cater to the individual learning needs of students. It aims to make
learning fun, engaging, and effective by leveraging technology and data-driven insights.
The Content Management Engineering team at BYJU'S plays a critical role in the efficient
management and organization of educational content on the BYJU'S learning app. The team
is responsible for developing and maintaining the infrastructure, systems, and processes that
enable the seamless delivery of content to millions of learners.
6
One of the key responsibilities of the Content Management Engineering team is to design and
maintain the content management systems and platforms. They develop robust and scalable
systems that allow for the efficient storage, retrieval, and organization of educational content.
These systems ensure that the vast amount of content available on the platform is easily
accessible, searchable, and properly categorized, enabling smooth navigation and a seamless
user experience.
The team also works closely with content creators, subject matter experts, and instructional
designers to establish content creation workflows and guidelines. They develop tools and
technologies that facilitate collaborative content development, version control, and content
review processes. By streamlining these workflows, the team ensures the timely and accurate
delivery of high-quality educational content to the app.
7
2. Introduction
During my internship, I worked on two main projects: data deduplication and Real-time
duplicate detection on questions databases. For data deduplication, I developed a Python
script that utilized OCR extraction, data cleaning, and clustering techniques to eliminate
duplicate questions from the company's database. The script was optimized and fine-tuned to
deliver accurate results. The real-time duplicate detection project involved implementing a
recommendation system using Elasticsearch and Golang for the newly created questions. I
developed indexing and querying APIs, ensuring efficient and personalized question
recommendations. The codebase was successfully deployed using Docker Compose and
Amazon ECS, and collaboration with a front-end developer facilitated web service
integration.
The report concludes by discussing the limitations encountered during the projects and
provides suggestions for future work, including optimization, scalability enhancements, and
improvements to the recommendation system's accuracy and personalization.
Overall, this internship report offers a concise summary of the tasks, methodologies, and
observations from my internship at BYJU'S, providing insights into the projects undertaken
and their impact on the company's backend systems.
8
3. Background
The motivation behind choosing the projects of data deduplication and real-time question
indexing during my internship at BYJU'S stemmed from a strong desire to contribute to the
improvement of the company's backend systems by removing redundant data in the form of
duplicates and enhancing the user experience of the BYJU'S learning app. The questions
database at BYJU'S is extensive, containing thousands of questions, and the presence of
duplicate questions creates redundancies in the database. The primary objective of the data
deduplication project was to tackle these redundancies and prevent the creation of further
duplicates.
Data deduplication plays a crucial role in maintaining data integrity and accuracy within any
database. By eliminating duplicate questions, the efficiency and effectiveness of the app's
content delivery can be significantly improved. This project provided an opportunity to delve
into the complexities of data deduplication techniques, such as pattern-matching algorithms,
and gain practical experience in handling large volumes of data. The challenges involved in
extracting OCR data, cleaning and clustering the data, and optimizing the deduplication
process added to the appeal of the project.
The second project, real-time question indexing, and recommendation, offered an exciting
opportunity to explore cutting-edge technologies like Elasticsearch and Golang. The aim was
to develop a scalable and personalized recommendation system that could provide real-time
question recommendations to users. This capability would greatly enhance the app's
engagement and adaptability for individual users. The project involved challenges such as
designing the project architecture, implementing the indexing and recommendation
algorithms, and deploying the codebase on the cloud. Collaboration with a front-end
developer was also necessary to integrate the backend system with the app's user interface.
Moreover, both projects aligned perfectly with my interest in backend development, data
management, and optimization. The opportunity to work on projects that directly impact the
app's performance, data quality, and user satisfaction was highly appealing. Additionally, the
internship provided a chance to expand my knowledge and skills in Python, Golang,
Elasticsearch, and cloud deployment, which are highly sought-after in the industry.
Overall, the motivation for choosing these projects was driven by the desire to make a
meaningful contribution to BYJU'S by tackling real-world challenges in data management,
9
optimizing system performance, and implementing cutting-edge technologies to enhance the
user experience of the BYJU'S learning app.
10
4. Objectives
The first objective of the internship was to perform data deduplication on the
company's database. The main focus was on eliminating duplicate questions, for
enhancing the overall quality and efficiency of the database. This involved
implementing a multi-step process, including OCR extraction with parallel
processing, data cleaning, and clustering techniques. The aim was to develop a robust
and accurate system that would identify and remove redundant questions, thereby
improving the usability and reliability of the database.
The second objective of the internship was to develop a real-time question indexing
and recommendation system. This task involved utilizing ElasticSearch, a powerful
search engine technology, and Golang, a programming language known for its
efficiency and scalability. The aim was to create a system that could efficiently index
new questions in real-time and provide relevant recommendations based on user
queries. This objective focused on enhancing the user experience and ensuring quick
and accurate access to relevant information.
11
5. Methodology
The methodology employed during the internship projects involved a systematic approach to
accomplish the objectives set for the internship period. The following steps were followed:
The first step was to thoroughly research and comprehensively understand data
deduplication techniques. Comprehensive surveys of data deduplication techniques
were referred to that offer insights into the various methods used in eliminating
duplicate data [1]. This included studying pattern-matching algorithms and delving
into the intricacies of database relationships. This research phase helped in
establishing a strong foundation and acquiring the necessary knowledge to tackle the
deduplication task effectively.
Data collection techniques were used to get the required data. The comprehensive
guide of the same was referred [2].
The first step in the Python script development was to extract OCR from a
large volume of data. This involved processing various documents and
converting them into machine-readable text. To optimize the extraction
process, parallel processing techniques were utilized, allowing for faster and
more efficient extraction of OCR from multiple documents simultaneously.
Python's library BeautifulSoup was used to identify image tags and the URL
image was taken using the PIL library.
12
Figure 2: OCR Process Flow
Once the OCR data was extracted, it underwent a cleaning process to remove
any unwanted characters, symbols, or formatting inconsistencies. This was
achieved by leveraging the regular expression library in Python, which
provided powerful pattern-matching and manipulation capabilities. By
applying a set of predefined rules and patterns, the script was able to cleanse
the data, ensuring uniformity and consistency throughout.
13
5.2.3: Clustering based on TF-IDF and Cosine Similarity
After cleaning the data, the next step involved clustering the questions to
identify duplicates. The script utilized the TF-IDF technique, which assigns
weights to each term based on its frequency in a document and its importance
in the overall corpus. By vectorizing the questions using TF-IDF, similarities
between questions could be computed using cosine similarity, a measure of the
angle between two vectors. This allowed the script to group similar questions
together, effectively identifying and clustering duplicate questions in the
database.
To prepare for the real-time deduplication task, extensive research was conducted on
various methods and technologies available. Focus was given to studying OpenSearch
and ElasticSearch, popular search engine technologies built on top of the Lucene
search library [3]. This phase involved understanding the functionalities and
capabilities of these technologies, along with exploring their feasibility and potential
applicability to the project.
14
5.4: Learning Golang and Designing the Project Architecture
15
5.5: Implementation, Validation, and Optimization:
Once the code was implemented, rigorous testing and validation procedures
were conducted to verify its functionality and correctness. Various test cases
were designed to cover different scenarios and edge cases, ensuring that the
code handled all possible inputs and produced the expected outputs. The
testing process involved both unit testing, which focused on individual
components or functions, and integration testing, which examined the
interactions between different modules. By thoroughly testing the code, any
bugs, errors, or inconsistencies were identified and addressed, ensuring the
reliability and stability of the system.
16
utilization. By optimizing the code, the system's responsiveness and scalability
were improved, allowing it to handle a larger volume of requests and provide
faster results.
17
5.6: Deployment and Collaboration
To make the project operational, the codebase was deployed using Docker Compose
[5], facilitating the creation of a containerized environment. The code was then
uploaded to Amazon ECS (Elastic Container Service), ensuring scalability and
efficient deployment on the cloud. Collaboration with a front-end developer was
necessary to integrate the code into a web service where the questions were accessed
and utilized. Throughout this phase, collaboration and effective communication were
maintained to ensure seamless integration and deployment.
18
6. Roles and Responsibilities
19
6.3 Observations and Findings
● Identifying and documenting the key observations and findings made during
the projects, such as the successful completion of the data deduplication task
and the efficient querying of deduplicated data.
● Analyzing the impact of the implemented projects on the overall system
performance and user experience.
● Providing insights into the effectiveness of the methodologies and
technologies employed, and their contribution to the improvement of the
BYJU'S learning app.
20
7. Tools and Technologies
● Python: Used for developing the deduplication script and performing data
cleaning operations.
● OCR (Optical Character Recognition): Utilized for extracting text from
images and scanned documents.
● Regular Expressions: Employed for data cleaning and pattern matching.
● TF-IDF (Term Frequency-Inverse Document Frequency): Utilized for
clustering and similarity calculations.
● Cosine Similarity: Used for measuring the similarity between questions.
● Parallel Processing: Employed to enhance the efficiency of the deduplication
process.
21
These tools and technologies were instrumental in carrying out the tasks and achieving the
objectives of your internship projects, enabling data deduplication, real-time question
indexing, and recommendation functionalities, and enhancing the overall performance and
user experience of the BYJU'S learning app.
22
8. Observations and Findings
The querying process of the deduplicated data was found to be efficient and effective.
By leveraging join and subqueries, relevant and non-duplicate data was extracted and
presented for validation. The optimized script ensured seamless retrieval of
information without compromising the system's performance. This observation
highlighted the successful integration of data deduplication techniques with the
existing database structure, enabling smoother data operations and enhancing the
overall functionality of the app.
23
8.3: Real-time Question Indexing and Recommendation
An Example of the output received from the Real-time duplicate detection API along
with the percentage similarity and all the required information.
24
Figure 9: Similar Existing Questions While Creating Question
In conclusion, the observations and findings made during the internship projects
highlighted the positive impact of data deduplication and real-time question indexing
on the overall system performance and user experience. The successful completion of
the tasks, accurate querying of deduplicated data, and implementation of a
personalized recommendation system showcased the effectiveness of the
methodologies and technologies employed. These insights contribute to the
continuous improvement of the BYJU'S learning app, ensuring that users receive
high-quality and tailored content for an enhanced learning journey.
25
9. Limitations
One of the limitations faced during the internship projects was related to the quality of
the data. The deduplication task relied on the consistency and accuracy of the existing
database. However, inconsistencies in question formatting, spelling errors, and
inconsistent labeling posed challenges to the deduplication process. Despite efforts to
clean the data using automated techniques, certain cases required manual intervention,
which added complexity and increased the time required for deduplication.
Addressing data quality issues was crucial to ensure accurate and reliable
deduplication results.
Another limitation was the algorithmic complexity associated with the deduplication
process. As the volume of data increased, the computational complexity of the
clustering algorithm used for identifying duplicate questions also grew. The time
complexity of the algorithm increased with the number of questions in the database,
leading to longer processing times and resource-intensive computations. Although
optimization techniques were employed to improve performance, there were practical
limitations to achieving real-time duplicate detection on extremely large datasets
within the given project timeframe.
26
In conclusion, the limitations encountered during the internship projects encompassed
challenges related to data quality, algorithmic complexity, and resource constraints.
Overcoming these limitations required careful consideration, trade-offs, and the application
of appropriate techniques and strategies. Despite these challenges, efforts were made to
mitigate the limitations and achieve satisfactory results within the given constraints,
contributing to the successful completion of the internship projects.
27
10. Conclusion and Future Work
The internship at BYJU'S as a backend software developer has been an invaluable experience
that has provided me with significant learning opportunities and insights. Throughout the
projects of data deduplication and real-time question indexing, I have gained practical
knowledge and skills in various areas, including data management, algorithm
implementation, optimization techniques, and the utilization of cutting-edge technologies like
Elasticsearch and Golang. These experiences have not only deepened my understanding of
backend development but also enhanced my problem-solving abilities and critical thinking
skills.
One of the key takeaways from the internship is the importance of data quality and its impact
on system performance. The data deduplication project highlighted the significance of
consistent and clean data, as well as the challenges associated with ensuring data accuracy in
a large-scale database. I have learned the importance of data preprocessing and the need for
robust algorithms to handle real-world data with variations and inconsistencies. Additionally,
the implementation of real-time question indexing and recommendation has emphasized the
importance of scalability and resource optimization, especially when dealing with high
volumes of data and user queries.
Moving forward, there are several ways in which the projects can be taken further to enhance
their outcomes. Firstly, for the data deduplication task, further research and experimentation
can be conducted to explore more advanced clustering algorithms and machine learning
techniques to improve accuracy and efficiency. Additionally, investing in data quality
improvement processes, such as automated data cleaning and validation techniques, can help
overcome the challenges posed by inconsistent data sources. For the real-time question
indexing and recommendation system, ongoing optimization efforts can be made to enhance
response times, personalized recommendations, and explore additional features such as
sentiment analysis or topic modeling to improve content relevance.
28
improvement in delivering a high-quality user experience. By further exploring advanced
algorithms, refining data quality processes, and optimizing the recommendation system, the
projects can be taken to the next level, turning the deliverables into a more polished and
impactful end product.
29
11. Bibliography
[1] Gupta, S., & Goyal, R. (2017). Data Deduplication Techniques: A Survey. International
Journal of Computer Applications, 174(4), 1-5.
[2] Garcia-Molina, Ullman, and Widom's book "Database Systems: The Complete Book".
30
12. Peer Review Form
Mention three strengths of the work done: Effective Data Deduplication, Real-time Question
Indexing and Recommendation, Practical Application of Skills.
Provide some useful recommendations: Explore Advanced Machine Learning Techniques, Scale
Infrastructure for Real-time Processing, Implement User Feedback Mechanism
Signature of the
Evaluator student:
31