Intern Report (2)
Intern Report (2)
On
Data Intern
At
eXtenso Data
Submitted To
Swastik College
Chardobato, Bhaktapur
Submitted By
Kebal khadka
May,2025
Supervisor’s Recommendation
I hereby recommend that this report, prepared under my supervision by Kebal khadka(TU
Roll No. 26861/077), be accepted as fulfilling in partial requirement for the degree of
Bachelor of Science in Computer Science and Information Technology. In my best
knowledge, this is an original work in Computer Science and Information Technology.
........................................
Swastik College
i
Letter of Approval
This is to certify that this report, prepared by Kebal khadka (26861/77) in partial
fulfillment of the requirement for the degree of Bachelor of Science in Computer Science
and Information Technology, has been well studied. In our opinion, it is satisfactory in the
scope and quality of the project for the required degree.
ii
Acknowledgement
I want to sincerely thank eXtenso Data for giving me the chance to work as a data intern.
I am incredibly appreciative of the priceless influence this experience has had on my
development on both a personal and professional level. It has been a pillar of my career
path.
I am incredibly grateful to Mr. Suresh Gautam, CEO of eXtenso Data, for giving me this
internship opportunity and allowing me to learn a great deal about a variety of industries.
His mentorship and continuous support during my internship have been instrumental in
shaping my professional development, and I deeply value his guidance and
encouragement.
I am sincerely thankful to my supervisor, Ms. Sristi Khatiwada, for her exceptional
guidance, unwavering support, and inspiring encouragement throughout my internship.
Report generation is one of the areas where her wise criticism and guidance have greatly
improved my abilities.
Lastly, I would like to extend my sincerest regards and heartfelt gratitude to all of my
esteemed colleagues, fellow workers, and any other individuals who have provided me
with unwavering support throughout the entirety of this period.
iii
Abstract
Data engineering involves designing and building robust systems that facilitate the
collection, transformation, and management of large-scale data to support strategic
decision-making. This report summarizes my internship as a Data Engineering Intern at
eXtenso Data, a Big Data Analytics company dedicated to enhancing operational
efficiency, optimizing costs, and uncovering new business opportunities through data -
driven insights. During my time at eXtenso Data, I was actively involved in developing
and maintaining data pipelines using Python and SQL, and I worked extensively with
MySQL for data storage and querying. Additionally, I gained hands-on experience with
big data tools such as Apache Spark for large-scale data processing and Apache Airflow
for orchestrating complex data workflows.
This internship deepened my understanding of the complete data engineering lifecycle—
from data ingestion and transformation to scheduling and automation—and provided me
with valuable experience in building scalable data solutions in a real-world business
setting.
Keywords: Data Engineering, Big Data, Python, SQL, MySQL, Apache Spark, Apache
Airflow, Data Pipelines, Data Ingestion, Data Transformation.
iv
Table of Contents
Supervisor’s Recommendation.............................................................................................. i
Acknowledgement .................................................................................................................iii
Abstract ................................................................................................................................... iv
v
4.2 Learning Outcome ........................................................................................................ 13
References .............................................................................................................................. 15
Annex ...................................................................................................................................... 16
vi
List of Tables
vii
Chapter 1: Introduction
1.1 Introduction
Data engineering is a crucial field that focuses on designing, building, and managing the
infrastructure and tools needed to collect, store, process, and analyze large volumes of
data. It plays a vital role in enabling organizations to make data-driven decisions and gain
valuable insights from their data. During my ongoing internship, I am building a strong
foundation in data engineering by working on data collection, transformation, and
pipeline development. I am actively involved in creating scalable data workflows,
managing databases, and ensuring data quality across various stages of the pipeline.
During this internship, I focused on building end-to-end data pipelines to support reliable
and scalable data workflows. I started by developing ETL scripts in Python to collect and
transform data from various sources using tools like Selenium Base for web automation
and Pandas for data cleaning and transformation.
I gained hands-on experience with SQL, which I used extensively for querying and
transforming data from structured databases. This laid a strong foundation in data
wrangling, joins, aggregations, and subqueries—essential operations in any data
engineering role.
As the internship progressed, I was introduced to modern big data tools such as Apache
Airflow for scheduling and orchestrating complex data workflows, and Apache Spark
for distributed processing of large datasets. These technologies allowed me to scale data
processing tasks beyond traditional scripting and move toward production-ready
pipelines.
1
Extract data from multiple sources,
Clean and transform it into a structured format,
Load it into storage systems or databases for further analysis.
1.3 Objectives
To develop and implement automated ETL (Extract, Transform, Load) pipelines
using Python and SQL to efficiently ingest and process structured and
unstructured data from multiple sources.
To gain practical experience with modern data engineering tools and frameworks,
including Apache Airflow for workflow orchestration and Apache Spark for
distributed big data processing.
To ensure data quality and integrity through effective data cleaning,
transformation, and validation processes, enabling reliable storage and
downstream use by analytics or reporting systems.
1.4 Scopes
To build and manage ETL pipelines using Python and SQL for transforming raw
sanction list data into structured, analyzable formats.
To work with tools like Apache Airflow and Spark for understanding scalable data
processing and workflow automation in a big data environment.
To ensure data consistency and integrity by applying data cleaning techniques,
handling missing values, and standardizing formats using Pandas
1.5 Limitations
2
This chapter gives a summary of the project’s objectives, limitations, and scope.
Chapter 2: Organizational Details and Literature Review
An overview of the organization is provided in this chapter, which also includes
descriptions of the intern department/unit and its functional areas and organizational
structure. This chapter covers the key theories, concepts, and terminology related to the
internship project in order to provide the context for the background study. A literature
study, an evaluation of projects that are comparable to the internship, and theories and
outcomes that are comparable to the projects during the internship are also included.
Chapter 3: Internship Activities
This chapter covers the complete internship program. It includes details about the project
completed during the internship, the decisions made, the roles and duties assumed, and
the weekly logs kept.
Chapter 4: Conclusion and Learning Outcomes
This report’s conclusion and discussion of the internship’s learning objectives are covered
in this chapter.
3
Chapter 2: Organization Details and Literature Review
eXtensoData, a prominent business vertical of F1Soft Group, was founded in 2018 and is
led by CEO Suresh Gautam. It is a Big Data Analytics company focused on helping
businesses harness the power of their data to improve operational efficiency, optimize
costs, and uncover new opportunities. With a mission to turn raw data into actionable
intelligence, eXtensoData provides a broad suite of advanced data services tailored to
modern business needs
The company’s key areas of expertise include Data Engineering, Process Automation,
Business Analysis, Forecasting, Process Optimization, and Big Data Consulting. Its data
engineering services are designed to transform complex organizational data into
intelligent, timely insights, enabling data-driven decisions. Through process automation,
eXtensoData streamlines repetitive business tasks and eliminates inefficiencies by
leveraging enterprise data and building robust automation platforms.
In addition, the company offers business analysis support at both operational and strategic
levels, enhancing daily performance and delivering insights aligned with emerging
business trends. Its forecasting solutions empower clients with technology-driven
financial foresight, seamlessly integrating predictive models with operational strategies.
Email: [email protected]
4
within the company. This organizational structure consists of multiple divisions that work
closely to support business operations. The organization at eXtenso data is structured with
a top-down hierarchical structure of authority. This hierarchy ensures clear lines of
command and responsibility, allowing for effective management and oversight.
i. Data Engineering:
We offer data engineering services that transform organizational data into meaningful,
intelligent insights. Our comprehensive data solutions are designed to address diverse
business challenges, enabling our clients to make timely and informed decisions.
iv. Forecasting:
Forecasting is a key component of effective business planning. Our technologies
automate the forecasting process, making it easier for organizations to align financial
projections with operational strategies for sustained success.
5
2.1.4 Description of Intern Department/Unit
6
2.2 Literature Review
Throughout the internship at eXtenso Data, although direct literature specifically tailored
to the projects undertaken was limited, relevant studies in ETL processes, Big Data
Analytics, data visualization, and automation testing provided critical insights that guided
the work and ensured best practices in Data Engineering.
The study titled “Study of ETL Process and Its Testing Techniques” by Mr. Sujit Prajapati
and Mrs. Sarala Mary(2022) explores the fundamental role of the ETL (Extract,
Transform, Load) process in the data warehousing lifecycle. The ETL process forms the
backbone of data integration by extracting data from multiple sources, transforming it in a
staging area, and finally loading it into the data warehouse. The study further delves into
ETL testing techniques, which are essential for validating data accuracy and integrity
post-transformation.
In another relevant paper titled “Big Data Analytics: A Literature Review Paper”
presented at ENCADEMS 2020, the authors Nikhil Madaan, Umang Kumar, and Suman
Kr Jha(2020) address the challenges posed by the three Vs of Big Data: Volume, Velocity,
and Variety. The paper highlights the limitations of traditional data handling tools in
managing such complex data sets and explores how Big Data Analytics enables
organizations to derive valuable insights from rapidly growing and dynamic data.
Further insight was gained from the article “Evolving Paradigms of Data Engineering in
the Modern Era: Challenges, Innovations, and Strategies” by Alekhya Achanta and Roja
Boina (2023). The paper explores the shift from traditional batch data pipelines to real-
time streaming architectures, driven by the need for speed and scalability. Innovations
such as cloud computing, data lakes, machine learning automation, and self-service
platforms are presented as solutions to modern data engineering challenges
7
Chapter 3: Internship Activities
8
3.2Weekly log
Table 3.1 weekly Log
Week Task
1 1. Introduction to SQL and relational database concepts.
2. Learning basic to advanced SQL queries (Joins, Subqueries, Window
Functions).
3. Hands-on practice with SQL on sample datasets.
9
3. Presentation of the entire ETL workflow and documentation of the
process.
2. Data Collection: Developing scripts to extract sanctions data from at least five
official international sources, each available in different formats (CSV, XML,
HTML, JSON) and structures.
3. Data Cleaning and Processing: Parsing, standardizing, and transforming the data
into a unified tabular format while resolving inconsistencies, missing fields, and
schema mismatches.
4. Database Integration: Storing the cleaned and structured data into a MySQL
relational database designed for easy querying, analysis, and compliance checks.
10
5. Reporting: Exporting the entire consolidated dataset using mysqldump into a .sql
file for backup, archival, and integration into compliance systems.
Developed Python scripts to download and extract data from at least five
different official sanctions sources.
Handled different data structures and formats using libraries such as
requests, xml.etree, json.
Used Python libraries such as Pandas to parse and standardize data fields
across all sanctions lists.
Resolved inconsistencies in naming conventions, removed duplicates, and
structured the data into a uniform format.
Ensured all records followed a unified schema to allow smooth integration
into the database.
4. Database Integration:
11
Used mysql.connector in Python to insert processed data into the MySQL
database.
Verified referential integrity and ensured that all data could be queried
efficiently.
12
Chapter 4: Conclusion and Learning Outcomes
4.1 Conclusion
My time as an intern at eXtenso Data. has been an ongoing journey of growth and
learning. Working on a challenging project that involves data extraction, transformation,
and loading from global sanctions sources has allowed me to enhance my technical skills
in Python and MySQL while gaining valuable insights into real-world data engineering
workflows.
Although the internship is still in progress, I have already gained hands-on experience in
addressing real business needs through designing an ETL pipeline and dealing with
diverse data formats. Collaborating with the technical team and receiving mentorship has
improved my communication and problem-solving skills, while also deepening my
interest in the fields of data engineering and compliance analytics.
I look forward to completing the internship and continuing to apply what I’ve learned to
the remaining phases of the project. This experience is shaping a strong foundation for my
future academic and professional aspirations, and I’m grateful for the opportunity to
contribute meaningfully while continuing to learn.
13
Initiated work on integrating cleaned data into a MySQL database.
14
References
Achanta, A., & Boina, R. (2023). Evolving Paradigms of Data Engineering in the Modern
Era: Challenges, Innovations, and Strategies. International Journal of Science and
Research (IJSR), 12(10), 606–610. https://doi.org/10.21275/SR231007071729
eXtenso Data. (n.d.). Services - eXtensoData . Retrieved May 13, 2025, from
https://www.extensodata.com/services
Madaan, N., Kumar, U., & Jha, S. K. (2020). Big Data Analytics: A Literature Review
Paper. International Journal of Engineering Research & Technology, 8(10).
https://doi.org/10.17577/IJERTCONV8IS10003
Prajapati, Mr. S., & Mary, Mrs. S. (2022). Study of ETL Process and Its Testing
Techniques. International Journal for Research in Applied Science and Engineering
Technology, 10(6), 871–877. https://doi.org/10.22214/IJRASET.2022.43931
15
Annex
i) Snapshot of code used for performing ETL
16
ii) Snapshot of code used for inserting Data into the Database
17
iii)Snapshot of parsing Data from the Data source
18