0% found this document useful (0 votes)
23 views

Final Project Report

This internship report summarizes Shantanu Dev's internship projects at BYJU'S, an Indian educational technology company. The report describes two main projects: 1) Developing a Python script to perform data deduplication on BYJU'S question database by identifying and removing duplicate questions using OCR extraction, data cleaning, and clustering techniques. 2) Implementing a real-time question recommendation system using Elasticsearch and Golang to provide efficient search and personalized recommendations of newly created questions. Through these projects, valuable skills were gained in areas like data deduplication, real-time indexing, Python, Golang, Elasticsearch, and cloud deployment. The successful completion of the projects helped improve BYJU'S learning app

Uploaded by

Shane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Final Project Report

This internship report summarizes Shantanu Dev's internship projects at BYJU'S, an Indian educational technology company. The report describes two main projects: 1) Developing a Python script to perform data deduplication on BYJU'S question database by identifying and removing duplicate questions using OCR extraction, data cleaning, and clustering techniques. 2) Implementing a real-time question recommendation system using Elasticsearch and Golang to provide efficient search and personalized recommendations of newly created questions. Through these projects, valuable skills were gained in areas like data deduplication, real-time indexing, Python, Golang, Elasticsearch, and cloud deployment. The successful completion of the projects helped improve BYJU'S learning app

Uploaded by

Shane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

INTERNSHIP SEMESTER REPORT

Data Deduplication

by

Shantanu Dev

Roll No. 101917052

Under the Guidance of

Mr. Pratik Jain, Principal Engineer

Byju’s (Think & Learn Pvt. Ltd.)

Dr. Saif Nalband

Assistant Professor (CSED)

Submitted to the

Computer Science & Engineering Department


Thapar Institute of Engineering & Technology, Patiala

In Partial Fulfilment of the Requirements for the Degree of

Bachelor of Engineering in Computer Engineering at

Thapar Institute of Engineering & Technology, Patiala

June 2023
The author hereby grants to Thapar Institute of Engineering & Technology, permission to
reproduce and to distribute publicly paper and electronic copies of this report document in
whole and in part in any medium now known or hereafter created.

Data Deduplication

by Shantanu Dev

Place of work: Think and Learn private ltd (Byjus)

Submitted to the Computer Science & Engineering Department,

Thapar Institute of Engineering & Technology

June 2023

In Partial Fulfilment of the Requirements for the

Degree of Bachelor of Engineering in Computer Engineering

1
ABSTRACT

This internship report provides a comprehensive overview of my experience as a backend


software developer at BYJU'S, focusing on two main projects: data deduplication and
real-time question indexing. The report highlights the tasks undertaken, methodologies
employed, observations made, and conclusions drawn during the internship.

The first project involved performing data deduplication on the company's database,
specifically eliminating duplicate questions. Extensive research was conducted on various
deduplication techniques, leading to the development of a three-step Python script. This script
utilized OCR extraction, data cleaning, and clustering techniques to identify and eliminate
duplicate questions. The optimized script was presented to the team and stakeholders,
resulting in improved data quality and system performance.

The second project focused on real-time question indexing and recommendation using
Elasticsearch and Golang. Through the study of real-time deduplication methods,
OpenSearch, and ElasticSearch, a robust architecture was designed. Golang was employed to
implement indexing and querying APIs, providing efficient and personalized question
recommendations. The codebase was deployed using Docker Compose and Amazon ECS,
ensuring scalability and seamless integration with the front-end.

Throughout the internship, valuable skills were gained in data deduplication, real-time
indexing, Python, Golang, Elasticsearch, and cloud deployment. The projects' successful
completion contributed to the enhancement of the BYJU'S learning app, improving user
experience and system performance.

The report concludes with limitations encountered during the projects and suggestions for
future work, including further optimization, scalability, and enhancements to the
recommendation system.

This internship report serves as a comprehensive record of the tasks accomplished, skills
acquired, and the impact of the projects on the BYJU'S learning app. It provides insights into
the challenges faced, methodologies employed, and recommendations for future
development.

Author: Shantanu Dev

2
CERTIFICATE

Certified that project entitled “Data Deduplication” which is being submitted by Shantanu
Dev (101917052) to the Department of Computer Science & Engineering, TIET, Patiala,
Punjab, is a record of project work carried out by him/her under guidance and supervision of
Mr. Pratik Jain. The matter presented in this project report does not incorporate without
acknowledgment any material previously published or written by any other person except
where due reference is made in the text.

Shantanu Dev 101917052

Industry Mentor
Mr. Pratik Jain

Principal Engineer

Byju’s (Think & Learn Pvt. Ltd.)

3
Faculty Mentor Signature

Verification of Employment

4
TABLE OF CONTENTS

Chapter Page No.

Byju’s Profile 6

Introduction 8

Background 9

Objectives 11

Methodology 12

Roles and Responsibilities 19

Tools and Technologies Used 21

Observations and Findings 23

Limitations 26

Conclusion and Future Work 28

Bibliography 30

Peer Review Form 31

5
1. Company Profile

BYJU'S is a leading educational technology company based in India. It offers a


comprehensive learning app that provides personalized and engaging learning experiences to
students. The app covers a wide range of subjects, including math, science, and coding, and is
used by millions of students across India and other countries.

Figure 1: BYJU’S Logo


The BYJU'S learning app incorporates interactive videos, animations, quizzes, and adaptive

learning techniques to cater to the individual learning needs of students. It aims to make
learning fun, engaging, and effective by leveraging technology and data-driven insights.

During my internship at BYJU'S, I worked as a backend software developer, focusing on


tasks related to data deduplication and real-time duplicate detection by question indexing
using technologies such as Python, Elasticsearch, and Golang. My role involved working on
the optimization and scalability of backend systems to ensure smooth and efficient operations
of the BYJU'S learning app.

The Content Management Engineering team at BYJU'S plays a critical role in the efficient
management and organization of educational content on the BYJU'S learning app. The team
is responsible for developing and maintaining the infrastructure, systems, and processes that
enable the seamless delivery of content to millions of learners.

6
One of the key responsibilities of the Content Management Engineering team is to design and
maintain the content management systems and platforms. They develop robust and scalable
systems that allow for the efficient storage, retrieval, and organization of educational content.
These systems ensure that the vast amount of content available on the platform is easily
accessible, searchable, and properly categorized, enabling smooth navigation and a seamless
user experience.

The team also works closely with content creators, subject matter experts, and instructional
designers to establish content creation workflows and guidelines. They develop tools and
technologies that facilitate collaborative content development, version control, and content
review processes. By streamlining these workflows, the team ensures the timely and accurate
delivery of high-quality educational content to the app.

7
2. Introduction

This internship report provides an overview of my internship experience as a backend


software developer at BYJU'S. The report highlights the tasks, projects, methodologies,
technologies, observations, and limitations encountered during the internship. It concludes
with suggestions for future work.

During my internship, I worked on two main projects: data deduplication and Real-time
duplicate detection on questions databases. For data deduplication, I developed a Python
script that utilized OCR extraction, data cleaning, and clustering techniques to eliminate
duplicate questions from the company's database. The script was optimized and fine-tuned to
deliver accurate results. The real-time duplicate detection project involved implementing a
recommendation system using Elasticsearch and Golang for the newly created questions. I
developed indexing and querying APIs, ensuring efficient and personalized question
recommendations. The codebase was successfully deployed using Docker Compose and
Amazon ECS, and collaboration with a front-end developer facilitated web service
integration.

The internship provided valuable experience in data deduplication, real-time indexing,


Python, Golang, Elasticsearch, and cloud deployment. The projects contributed to improving
the BYJU'S learning app's performance and user experience.

The report concludes by discussing the limitations encountered during the projects and
provides suggestions for future work, including optimization, scalability enhancements, and
improvements to the recommendation system's accuracy and personalization.

Overall, this internship report offers a concise summary of the tasks, methodologies, and
observations from my internship at BYJU'S, providing insights into the projects undertaken
and their impact on the company's backend systems.

8
3. Background

The motivation behind choosing the projects of data deduplication and real-time question
indexing during my internship at BYJU'S stemmed from a strong desire to contribute to the
improvement of the company's backend systems by removing redundant data in the form of
duplicates and enhancing the user experience of the BYJU'S learning app. The questions
database at BYJU'S is extensive, containing thousands of questions, and the presence of
duplicate questions creates redundancies in the database. The primary objective of the data
deduplication project was to tackle these redundancies and prevent the creation of further
duplicates.

Data deduplication plays a crucial role in maintaining data integrity and accuracy within any
database. By eliminating duplicate questions, the efficiency and effectiveness of the app's
content delivery can be significantly improved. This project provided an opportunity to delve
into the complexities of data deduplication techniques, such as pattern-matching algorithms,
and gain practical experience in handling large volumes of data. The challenges involved in
extracting OCR data, cleaning and clustering the data, and optimizing the deduplication
process added to the appeal of the project.

The second project, real-time question indexing, and recommendation, offered an exciting
opportunity to explore cutting-edge technologies like Elasticsearch and Golang. The aim was
to develop a scalable and personalized recommendation system that could provide real-time
question recommendations to users. This capability would greatly enhance the app's
engagement and adaptability for individual users. The project involved challenges such as
designing the project architecture, implementing the indexing and recommendation
algorithms, and deploying the codebase on the cloud. Collaboration with a front-end
developer was also necessary to integrate the backend system with the app's user interface.

Moreover, both projects aligned perfectly with my interest in backend development, data
management, and optimization. The opportunity to work on projects that directly impact the
app's performance, data quality, and user satisfaction was highly appealing. Additionally, the
internship provided a chance to expand my knowledge and skills in Python, Golang,
Elasticsearch, and cloud deployment, which are highly sought-after in the industry.

Overall, the motivation for choosing these projects was driven by the desire to make a
meaningful contribution to BYJU'S by tackling real-world challenges in data management,

9
optimizing system performance, and implementing cutting-edge technologies to enhance the
user experience of the BYJU'S learning app.

10
4. Objectives

4.1: Data Deduplication

The first objective of the internship was to perform data deduplication on the
company's database. The main focus was on eliminating duplicate questions, for
enhancing the overall quality and efficiency of the database. This involved
implementing a multi-step process, including OCR extraction with parallel
processing, data cleaning, and clustering techniques. The aim was to develop a robust
and accurate system that would identify and remove redundant questions, thereby
improving the usability and reliability of the database.

4.2: Real-Time Question Indexing and Recommendation

The second objective of the internship was to develop a real-time question indexing
and recommendation system. This task involved utilizing ElasticSearch, a powerful
search engine technology, and Golang, a programming language known for its
efficiency and scalability. The aim was to create a system that could efficiently index
new questions in real-time and provide relevant recommendations based on user
queries. This objective focused on enhancing the user experience and ensuring quick
and accurate access to relevant information.

By accomplishing these objectives, the internship aimed to contribute to the


improvement of the company's database management system and provide users with a
seamless and efficient experience while accessing and interacting with the question
database.

11
5. Methodology

The methodology employed during the internship projects involved a systematic approach to
accomplish the objectives set for the internship period. The following steps were followed:

5.1: Research and Understanding

The first step was to thoroughly research and comprehensively understand data
deduplication techniques. Comprehensive surveys of data deduplication techniques
were referred to that offer insights into the various methods used in eliminating
duplicate data [1]. This included studying pattern-matching algorithms and delving
into the intricacies of database relationships. This research phase helped in
establishing a strong foundation and acquiring the necessary knowledge to tackle the
deduplication task effectively.

Data collection techniques were used to get the required data. The comprehensive
guide of the same was referred [2].

5.2: Development of Python Script for Existing Content Deduplication

5.2.1: Extraction of OCR (Optical Character Recognition) from Data

The first step in the Python script development was to extract OCR from a
large volume of data. This involved processing various documents and
converting them into machine-readable text. To optimize the extraction
process, parallel processing techniques were utilized, allowing for faster and
more efficient extraction of OCR from multiple documents simultaneously.
Python's library BeautifulSoup was used to identify image tags and the URL
image was taken using the PIL library.

12
Figure 2: OCR Process Flow

5.2.2: Data Cleaning using Regular Expressions

Once the OCR data was extracted, it underwent a cleaning process to remove
any unwanted characters, symbols, or formatting inconsistencies. This was
achieved by leveraging the regular expression library in Python, which
provided powerful pattern-matching and manipulation capabilities. By
applying a set of predefined rules and patterns, the script was able to cleanse
the data, ensuring uniformity and consistency throughout.

Figure 3: Data Preprocessing

13
5.2.3: Clustering based on TF-IDF and Cosine Similarity

After cleaning the data, the next step involved clustering the questions to
identify duplicates. The script utilized the TF-IDF technique, which assigns
weights to each term based on its frequency in a document and its importance
in the overall corpus. By vectorizing the questions using TF-IDF, similarities
between questions could be computed using cosine similarity, a measure of the
angle between two vectors. This allowed the script to group similar questions
together, effectively identifying and clustering duplicate questions in the
database.

By implementing this Python script for existing content deduplication, the


internship project successfully addressed the challenge of eliminating duplicate
questions. The script's multi-step approach, including OCR extraction, data
cleaning, and clustering based on TF-IDF and cosine similarity, provided an
efficient and accurate solution for deduplicating the company's database.

Figure 4: Deduplication Script Flow

5.3: Study of Real-Time Duplicate Detection Methods and Technologies

To prepare for the real-time deduplication task, extensive research was conducted on
various methods and technologies available. Focus was given to studying OpenSearch
and ElasticSearch, popular search engine technologies built on top of the Lucene
search library [3]. This phase involved understanding the functionalities and
capabilities of these technologies, along with exploring their feasibility and potential
applicability to the project.

14
5.4: Learning Golang and Designing the Project Architecture

In order to implement real-time question indexing and recommendation, Golang was


chosen as the programming language. The internship involved dedicating time to
learning the basics of Golang [4] and familiarizing oneself with its syntax and core
concepts. Subsequently, the project architecture was designed, which included
defining the structure of classes and outlining low-level implementation details.
Additionally, research was conducted on techniques for optimizing and improving the
performance of the code, ensuring efficiency and scalability.

Figure 5: Elasticsearch Component Relation

15
5.5: Implementation, Validation, and Optimization:

5.5.1: Code Implementation

The implementation phase involved translating the design and architecture of


the real-time question indexing and recommendation system into actual code.
The code was developed using Golang, a programming language known for its
efficiency and scalability. Throughout the implementation process, best
practices and coding standards were followed to ensure the code's
maintainability and readability. Careful consideration was given to
modularizing the code, using appropriate data structures and algorithms, and
integrating with the required APIs and libraries.

5.5.2: Testing and Validation

Once the code was implemented, rigorous testing and validation procedures
were conducted to verify its functionality and correctness. Various test cases
were designed to cover different scenarios and edge cases, ensuring that the
code handled all possible inputs and produced the expected outputs. The
testing process involved both unit testing, which focused on individual
components or functions, and integration testing, which examined the
interactions between different modules. By thoroughly testing the code, any
bugs, errors, or inconsistencies were identified and addressed, ensuring the
reliability and stability of the system.

5.5.3: Optimization Techniques

To improve the performance and efficiency of the implemented code,


optimization techniques were employed. This involved analyzing the code and
identifying any bottlenecks or areas that could be optimized. Techniques such
as algorithmic improvements, caching, parallel processing, and memory
management were explored to enhance the system's speed and resource

16
utilization. By optimizing the code, the system's responsiveness and scalability
were improved, allowing it to handle a larger volume of requests and provide
faster results.

The combination of careful code implementation, rigorous testing, and


optimization techniques ensured that the real-time question indexing and
recommendation system was robust, reliable, and efficient. Through the
implementation, validation, and optimization phases, the internship project
successfully delivered a scalable and high-performing system that could index
new questions in real-time and provide accurate recommendations to users.

Figure 6: Real-time Duplicate Detection Workflow

17
5.6: Deployment and Collaboration

To make the project operational, the codebase was deployed using Docker Compose
[5], facilitating the creation of a containerized environment. The code was then
uploaded to Amazon ECS (Elastic Container Service), ensuring scalability and
efficient deployment on the cloud. Collaboration with a front-end developer was
necessary to integrate the code into a web service where the questions were accessed
and utilized. Throughout this phase, collaboration and effective communication were
maintained to ensure seamless integration and deployment.

By following this methodology, the internship projects were executed efficiently,


ensuring a systematic approach to achieve the set objectives. The combination of
research, development, implementation, validation, optimization, and deployment
enabled the successful completion of the tasks and provided valuable hands-on
experience with relevant technologies and techniques.

18
6. Roles and Responsibilities

6.1 Data Deduplication


● Researching and understanding data deduplication techniques, including
pattern matching algorithms and database relationships.
● Developing a Python script for existing content deduplication, involving OCR
extraction, data cleaning, and clustering based on TF-IDF and cosine
similarity.
● Implementing parallel processing techniques to enhance the efficiency of the
deduplication process.
● Fine-tuning the script by adjusting threshold parameters to achieve accurate
results.
● Collaborating with the content engineering team to ensure the successful
integration of the deduplication process into the existing database structure.

6.2 Real-Time Question Indexing and Recommendation

● Studying real-time duplicate detection methods and technologies, such as


OpenSearch and ElasticSearch.
● Learning Golang and designing the project architecture for real-time question
indexing and recommendation.
● Implementing the code for real-time indexing of questions using Elasticsearch
and Golang.
● Validating the functionality of the implemented system by conducting rigorous
testing and ensuring accurate question recommendations.
● Optimizing the performance of the system by following best practices and
employing optimization techniques.
● Collaborating with a front-end developer to integrate the recommendation
system into the BYJU'S learning app.

19
6.3 Observations and Findings

● Identifying and documenting the key observations and findings made during
the projects, such as the successful completion of the data deduplication task
and the efficient querying of deduplicated data.
● Analyzing the impact of the implemented projects on the overall system
performance and user experience.
● Providing insights into the effectiveness of the methodologies and
technologies employed, and their contribution to the improvement of the
BYJU'S learning app.

20
7. Tools and Technologies

Tools and technologies were used in your internship projects:

7.1 Data Deduplication

● Python: Used for developing the deduplication script and performing data
cleaning operations.
● OCR (Optical Character Recognition): Utilized for extracting text from
images and scanned documents.
● Regular Expressions: Employed for data cleaning and pattern matching.
● TF-IDF (Term Frequency-Inverse Document Frequency): Utilized for
clustering and similarity calculations.
● Cosine Similarity: Used for measuring the similarity between questions.
● Parallel Processing: Employed to enhance the efficiency of the deduplication
process.

7.2 Real-Time Question Indexing and Recommendation

● Elasticsearch: Used for real-time indexing and searching of questions.


● Golang: Used for developing the codebase and implementing the real-time
question indexing and recommendation system.
● API Integration: Collaborated with a front-end developer to integrate the
recommendation system into the BYJU'S learning app.

7.3 Deployment and Collaboration

● Docker Compose: Used for containerizing the codebase and facilitating


deployment.
● Amazon ECS (Elastic Container Service): Utilized for deploying the codebase
on the cloud.
● Collaboration Tools: Potentially used tools like Git, Jira, or project
management platforms to collaborate with team members and track progress.

21
These tools and technologies were instrumental in carrying out the tasks and achieving the
objectives of your internship projects, enabling data deduplication, real-time question
indexing, and recommendation functionalities, and enhancing the overall performance and
user experience of the BYJU'S learning app.

22
8. Observations and Findings

8.1: Data Deduplication

The implementation of data deduplication techniques proved to be successful during


the internship. Through the utilization of OCR extraction, data cleaning, and
clustering methods, duplicate questions were effectively identified and eliminated
from the company's database. The optimized script, combined with fine-tuned
threshold parameters, resulted in accurate deduplication outcomes, significantly
enhancing the data quality of the BYJU'S learning app. This observation
demonstrated the value of employing these techniques to provide users with unique
and relevant content, ultimately improving their learning experience.

Figure 7: Deduplication results for Threshold = 1

8.2: Querying Deduplicated Data

The querying process of the deduplicated data was found to be efficient and effective.
By leveraging join and subqueries, relevant and non-duplicate data was extracted and
presented for validation. The optimized script ensured seamless retrieval of
information without compromising the system's performance. This observation
highlighted the successful integration of data deduplication techniques with the
existing database structure, enabling smoother data operations and enhancing the
overall functionality of the app.

23
8.3: Real-time Question Indexing and Recommendation

The implementation of the real-time question indexing and recommendation system


provided valuable insights. By utilizing technologies like Elasticsearch and Golang, a
scalable and personalized recommendation system was developed. The real-time
indexing of questions and the subsequent API exposure facilitated efficient and
accurate question recommendations for users. This finding emphasized the
importance of leveraging advanced technologies and techniques to enhance the user
experience and adaptability of the app, ultimately improving engagement and
knowledge retention.

An Example of the output received from the Real-time duplicate detection API along
with the percentage similarity and all the required information.

Figure 8: Search Question Result with Similarity

24
Figure 9: Similar Existing Questions While Creating Question

In conclusion, the observations and findings made during the internship projects
highlighted the positive impact of data deduplication and real-time question indexing
on the overall system performance and user experience. The successful completion of
the tasks, accurate querying of deduplicated data, and implementation of a
personalized recommendation system showcased the effectiveness of the
methodologies and technologies employed. These insights contribute to the
continuous improvement of the BYJU'S learning app, ensuring that users receive
high-quality and tailored content for an enhanced learning journey.

25
9. Limitations

9.1: Data Quality

One of the limitations faced during the internship projects was related to the quality of
the data. The deduplication task relied on the consistency and accuracy of the existing
database. However, inconsistencies in question formatting, spelling errors, and
inconsistent labeling posed challenges to the deduplication process. Despite efforts to
clean the data using automated techniques, certain cases required manual intervention,
which added complexity and increased the time required for deduplication.
Addressing data quality issues was crucial to ensure accurate and reliable
deduplication results.

9.2: Algorithmic Complexity

Another limitation was the algorithmic complexity associated with the deduplication
process. As the volume of data increased, the computational complexity of the
clustering algorithm used for identifying duplicate questions also grew. The time
complexity of the algorithm increased with the number of questions in the database,
leading to longer processing times and resource-intensive computations. Although
optimization techniques were employed to improve performance, there were practical
limitations to achieving real-time duplicate detection on extremely large datasets
within the given project timeframe.

9.3: Resource Constraints

Resource constraints presented another limitation during the projects. The


implementation of real-time question indexing and recommendation required
substantial computational resources to handle the indexing process and respond to
user queries promptly. Limited resources, such as computing power or memory, could
potentially impact the system's responsiveness and scalability, particularly during
peak usage periods. Mitigating resource constraints involved optimizing code
efficiency, leveraging parallel processing techniques, and ensuring effective resource
management to achieve optimal system performance.

26
In conclusion, the limitations encountered during the internship projects encompassed
challenges related to data quality, algorithmic complexity, and resource constraints.
Overcoming these limitations required careful consideration, trade-offs, and the application
of appropriate techniques and strategies. Despite these challenges, efforts were made to
mitigate the limitations and achieve satisfactory results within the given constraints,
contributing to the successful completion of the internship projects.

27
10. Conclusion and Future Work

The internship at BYJU'S as a backend software developer has been an invaluable experience
that has provided me with significant learning opportunities and insights. Throughout the
projects of data deduplication and real-time question indexing, I have gained practical
knowledge and skills in various areas, including data management, algorithm
implementation, optimization techniques, and the utilization of cutting-edge technologies like
Elasticsearch and Golang. These experiences have not only deepened my understanding of
backend development but also enhanced my problem-solving abilities and critical thinking
skills.

One of the key takeaways from the internship is the importance of data quality and its impact
on system performance. The data deduplication project highlighted the significance of
consistent and clean data, as well as the challenges associated with ensuring data accuracy in
a large-scale database. I have learned the importance of data preprocessing and the need for
robust algorithms to handle real-world data with variations and inconsistencies. Additionally,
the implementation of real-time question indexing and recommendation has emphasized the
importance of scalability and resource optimization, especially when dealing with high
volumes of data and user queries.

Moving forward, there are several ways in which the projects can be taken further to enhance
their outcomes. Firstly, for the data deduplication task, further research and experimentation
can be conducted to explore more advanced clustering algorithms and machine learning
techniques to improve accuracy and efficiency. Additionally, investing in data quality
improvement processes, such as automated data cleaning and validation techniques, can help
overcome the challenges posed by inconsistent data sources. For the real-time question
indexing and recommendation system, ongoing optimization efforts can be made to enhance
response times, personalized recommendations, and explore additional features such as
sentiment analysis or topic modeling to improve content relevance.

In conclusion, the internship at BYJU'S as a backend software developer has provided me


with a wealth of learning experiences and practical skills. The projects of data deduplication
and real-time question indexing have expanded my knowledge in data management,
optimization techniques, and the utilization of advanced technologies. The conclusion of
these projects highlights the importance of data quality, scalability, and continuous

28
improvement in delivering a high-quality user experience. By further exploring advanced
algorithms, refining data quality processes, and optimizing the recommendation system, the
projects can be taken to the next level, turning the deliverables into a more polished and
impactful end product.

29
11. Bibliography

[1] Gupta, S., & Goyal, R. (2017). Data Deduplication Techniques: A Survey. International
Journal of Computer Applications, 174(4), 1-5.

[2] Garcia-Molina, Ullman, and Widom's book "Database Systems: The Complete Book".

[3] Elasticsearch: The Definitive Guide. (n.d.). Retrieved from


https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

[4] Golang Documentation. (n.d.). Retrieved from https://golang.org/doc/

[5] Docker Documentation. (n.d.). Retrieved from https://docs.docker.com/

30
12. Peer Review Form

Name of the student: Shantanu Dev Roll no. of the 101917052


(to be reviewed) student:

Title of the project: Data Deduplication

Name of the company:


Byju’s (Think & Learn Pvt. Ltd.)

Project report Good Average


Excellent

Project poster Good Average


Excellent

Rate the work done 0 – 10 points (Provide rating here) → 10

Overall performance 0 -5 marks (Provide marks here) → 5


marks

Abstract of the project:


The first project involved performing data deduplication on the company's database, specifically
eliminating duplicate questions. The second project focused on real-time question indexing and
recommendation using Elasticsearch and Golang. Throughout the internship, valuable skills were
gained in data deduplication, real-time indexing, Python, Golang, Elasticsearch, and cloud deployment.

Mention three strengths of the work done: Effective Data Deduplication, Real-time Question
Indexing and Recommendation, Practical Application of Skills.

Provide some useful recommendations: Explore Advanced Machine Learning Techniques, Scale
Infrastructure for Real-time Processing, Implement User Feedback Mechanism

Name of evaluator Vibhuti Gupta Roll no. of the 101917061


student: evaluator student:

Signature of the
Evaluator student:

31

You might also like