data
data
By
Mr. ARIF
Associate Professor
(NAAC Accredited with ‘B’ Grade and affiliated to Bengaluru North University)
Academic Year 2021 – 24
CERTIFICATE OF INTERNSHIP
This is to certify that SHAIK SABIR PAHSA bearing Registration Number: U19ZE21S0164, a
student ofSmt. Danamma Channabasavaiah College of Arts, Commerce, Science and Management
Studies has successfully completed an internship course from 10/06/2024 to 10/07/2024 at our
institution. During his internship,SHAIK SABIR PAHSA worked in the Data Science Department
and gained experience in the following areas:
➢ Preparing Documentation
His conduct during his stay with us was satisfactory. We wish his all the best for his future
endeavours.
(NAAC Accredited with ‘B’ Grade and affiliated to Bengaluru North University)
Date:
CERTIFICATE
This is to certify that SHAIK SABIR PASHA bearing Registered No. U19ZE21S0164, is a
student of VI Semester Bachelor of Computer Application of our College
He has prepared Internship report entitled “Data Science with Python”, SkillForge E-Learning
Solutions Pvt. Ltd. from 10/06/2024 to 10/07/2024 towards the partial fulfilment of the requirement
of Bachelors of Computer Application Degree of Bengaluru North University.
Principal
I SHAIK SABIR PASHA, Register Number: U19ZE21S0164, hereby declare that this report
entitled “DataScience with Python” during the internship period from 10/06/2024 to 10/07/2024 at
SkillForge E- Learning Solutions Pvt. Ltd. under the supervision and guidance of Mr.
Krishnamurthy, Associate Professor of Computer Science Department, Smt. Danamma
Channabasavaiah College of Arts, Commerce, Science and Management Studies , Kolar.
Date: Signature
U19ZE21S0164
ACKNOWLEDGEMENT
The successful completion of this internship report required significant guidance and assistance
from many individuals, and I am truly grateful for their support throughout this journey.
Firstly, I would like to express my sincere appreciation to Mr. Bharath Kumar, Academic Head,
SkillForge E-Learning Solutions Pvt Ltd., for providing me with the opportunity to intern at their
esteemed organization.
I am also deeply grateful to our principal, Prof. Pushpalatha K, for their unwavering support and
for granting me the valuable opportunity to perform the Internship on stage I also express my sincere
thanks to guide Mr Krishnamurthy, for his valuable guidance and timely suggestion at every stage
of this project.
I would like to extend my heartfelt thanks to my parents for their permission and constant
encouragement throughout this internship. Additionally, I am thankful to my friends for their support
whenever I needed their assistance during this project. Lastly, I would like to express my profound
gratitude to all individuals who directly or indirectly contributed to the completion of this report.
TABLE OF CONTENTS
1 Executive Summary
2 Introduction
3 Company Description
4 Experiential Learning
5 Tools used
6 Internship Outcomes
7 Conclusion
8 Bibliography
EXECUTIVE SUMMARY
This internship report delves into the application of Python for data science, exploring its
capabilities in extracting meaningful insights from complex datasets.
Through rigorous data exploration and analysis, key findings were uncovered regarding the
distribution of customer demographics, purchase patterns, and product preferences. These insights
were instrumental in understanding the underlying customer segments and informing subsequent
modelling efforts.
The outcomes of the study demonstrated the effectiveness of Python in addressing the research
objectives. The developed models achieved an accuracy in predicting customer churn and
provided valuable insights into customer behaviour. Additionally, the visualizations created
offered clear and compelling representations of customer segmentation and churn patterns,
facilitating insights and communication of results.
In conclusion, the internship successfully explored the potential of Python for data science. The
findings and outcomes contribute to the field by providing specific prediction
INTRODUCTION
Data Science:
Data Science, a multidisciplinary field, involves extracting valuable insights from data. Python, due
to its readability, versatility, and a rich ecosystem of libraries, has emerged as a preferred language
for data scientists. Its capabilities span data manipulation, analysis, visualization, and machine
learning.
The synergy between Python and data science has revolutionized various sectors. From finance and
healthcare to marketing and e-commerce, organizations are leveraging Python to make data-driven
decisions, optimize processes, and uncover new opportunities.
Data science is an interdisciplinary field that harnesses the power of data to extract meaningful
insights and drive informed decision-making.
By combining principles from mathematics, statistics, computer science, and domain expertise,
data scientists uncover hidden patterns, trends, and correlations within vast datasets.
collecting,
cleaning, and
Processes:
Big Data:
Big data refers to massive datasets that are complex and diverse, often generated at high speed,
making traditional data processing tools inadequate. It requires specialized techniques to extract
value and uncover hidden patterns.
Classification:
Classification is a supervised machine learning technique used to categorize data into predefined
classes or labels. It involves training a model on labeled data to learn patterns and predict the class
of new, unseen data points.
Analyse:
Analyse in data science involves exploring and investigating data to uncover insights, trends, and
relationships. It encompasses various statistical and exploratory techniques to understand data
characteristics and inform further analysis or modelling.
Statistics:
Statistics is the mathematical science concerned with collecting, organizing, analyzing, interpreting,
and presenting data. It provides the foundation for data-driven decision making and is essential for
drawing reliable conclusions from data.
Solving:
Solving problems with data science involves applying statistical and computational methods to
extract meaningful insights from data. This includes tasks like data cleaning, exploration, modeling,
and evaluation to address specific business questions or challenges.
Decision Making:
Decision-making in data science is driven by the insights derived from data analysis. By
understanding patterns, trends, and relationships within the data, organizations can make informed
choices, optimize processes, and identify new opportunities.
Knowledge:
Knowledge in data science encompasses the theoretical foundations, practical skills, and domain
expertise necessary to effectively work with data. It involves understanding statistical concepts,
programming languages, machine learning algorithms, and the ability to communicate findings to
both technical and non-technical audiences.
Python:
Python is a high-level, interpreted programming language renowned for its simplicity and
readability. It offers a vast standard library and supports multiple programming paradigms, making
it versatile for various applications. Python's emphasis on code clarity and efficiency has contributed
to its widespread adoption in fields such as data science, web development, and automation.
Python has emerged as the go-to language for data scientists due to its simplicity, readability, and
powerful ecosystem of libraries. It's a versatile language that can handle everything from data
cleaning and exploration to complex machine learning models.
Readability: Python's clean syntax makes it easy to understand and write code, even for
those without a strong programming background.
Extensive Libraries: Python boasts a rich collection of libraries specifically designed for
data science:
Community Support: A large and active community ensures constant development and
support for Python's data science tools.
Versatility: Beyond data science, Python can be used for web development, automation, and
more, making it a valuable skill to have.
Data Cleaning and Preprocessing: Handling missing values, outliers, and inconsistencies.
Exploratory Data Analysis (EDA): Summarizing data, finding patterns, and visualizing
relationships.
Data Visualization: Creating informative and visually appealing charts and graphs.
In essence, Python's combination of ease of use, powerful libraries, and strong community
support has made it the preferred language for data scientists worldwide.
COMPANY DESCRIPTION
Company Overview:
SkillForge E-learning Solutions Pvt Ltd is a dynamic edutech company specializing in mentor-
led skilling programs for emerging technologies. With a strong focus on practical learning, they
offer hands-on bootcamps designed to bridge the gap between academia and industry demands
Core Business:
Adapting curriculum to align with industry trends and job market requirements.
Target Audience:
SkillForge is driven by a mission to empower individuals with the necessary skills to excel in the
digital age. Their vision is to become a leading provider of high-quality, industry-aligned training
programs, enabling learners to achieve their career goals.
Services Offered
Flexible Learning: Online format allows students to learn at their own pace and
convenience.
Career Support: Guidance and assistance in job placement and career advancement.
Recognised By
SkillForge has experienced significant growth since its inception, expanding its course offerings and
student base. The company aims to further expand its reach by partnering with educational
institutions, corporations, and government organizations.
Like any startup, SkillForge faces challenges such as intense competition, maintaining course
quality, and scaling operations. However, the growing demand for skilled professionals in emerging
technologies presents significant opportunities for growth and expansion.
Certification Partners
Overall Assessment
SkillForge E-learning Solutions Pvt Ltd is a promising edutech company with a strong focus on
providing practical, industry-relevant training. Their commitment to student success and
adaptability to industry trends positions them well for future growth.
Company Pictures
About SkillForge
SkillForge is an edutech company specializing in mentor-led skilling programs for emerging
technologies. Our hands-on bootcamps adapt to industry changes, providing practical skills in ML,
AI, Data Science, Cyber Security, Cloud Computing and more. We bridge education and industry
requirements, aiding students in upskilling and securing career opportunities for real-world success.
Our belief in upskilling for a brighter future resonates with the importance of continuous learning
in today's evolving tech world. We are on a mission to Change How India Learns!
Key Milestones:
Foundedin2023, SkillForge has successfully upskilled more than 10,000 students till date.
Launched 15 new programsoverthe last 6 months in response to the dynamic shifts in the
job market.
Partnered with new corporate certification providers to strengthen our vision of creating an
industry-ready workforce.
Launched Career Assistance Program (CAP) to partner with institutions, aiding their
placement cells in preparing students for the placement season through Interview
Preparation, Resume Building, Project Showcase & Mock Interviews
Website
www.skillforge.in
Founder Profiles
Vamsi Krishna P, Founder & CEO
● 16+yearsofexperience in building & scaling brands across agencies, corporates & startups
● PastCompanies- Manipal UNext, Teabox, Licious, ChargeBee, Payback, Cognizant
● Linkedin profile- https://www.linkedin.com/in/vamsikrishnap/
Silpa DV, Co-Founder & COO
Programs Domains
SkillForge will offer programs in the following domains:
Tech Domains- Data Science, Amazon Web Services, Cyber Security, AutoCAD, Artificial
Intelligence, Web Development, Machine Learning, Embedded Systems Using Proteus Software,
MongoDB With Django, MongoDB With NodeJS, MySQL with Spring Boot, ReactJS, Microsoft
Azure Cloud Computing, VLSI, Genetic Engineering
Non-Tech Domains- Digital Marketing, Human Resource Management, Machine Learning, Stock
Marketing, Finance, Hybrid & Electric Vehicle, Car Design, Construction Planning And Structural
Analysis, IC Engines, Internet Of Things, Robotics, Marketing Management, Nanoscience And
Nanotechnology, UI/UX Design, Business Analytics, Graphic Designing
Program USPs
● Acquirefoundational skills with our 2-month bootcamps.
● Curriculum is designed to align with industry standards and help you launch your career.
● Gaininsights and guidance from industry mentors who are experts and working professionals
in their fields. Enjoy the flexibility of learning online from the comfort of your home.
Program Details
● Upto30hoursoflearning (varies based on the domain)
● 6-monthsofExtended LMSAccess
Program Benefits
● CourseCompletion Certificate
● Corporate Certification- Microsoft / Adobe / Autocad / Pearson VUE (exam cost is additional)
● Letter of Recommendation (LOR)- Based on merit for job & internship opportunities
Microsoft
AutoDesk
Adobe
Pearson Vue
Company Mission
SkillForge E-Learning Solutions Pvt Ltd, a company dedicated to "Changing How India
Learns!", embodies a mission to revolutionize the educational landscape in India. They believe that
learning should be accessible, innovative, and impactful for all. Through their unique approach,
SkillForge aims to break down traditional learning barriers and empower every individual to reach
their full potential.
SkillForge E-Learning Solutions Pvt Ltd, with its bold mission of "Changing How India Learns!",
is on a mission to disrupt the traditional education system in India. They envision a future where
learning is not a rigid, one-size-fits-all approach, but rather an accessible, innovative, and impactful
experience for every learner.
This translates into a commitment to developing and delivering cutting-edge e-learning solutions
that cater to the diverse needs of the Indian population. SkillForge recognizes the critical role
education plays in individual and national development. By making learning accessible and
engaging, they aim to empower learners across the country to unlock their full potential.
Through their innovative e-learning platform and focus on in-demand skills, SkillForge E-Learning
Solutions Pvt Ltd is actively working towards its mission and vision. They are committed to playing
a transformative role in shaping the future of education in India, ensuring that every individual has
the opportunity to learn, grow, and achieve their career aspirations.
Technology Infrastructure
SkillForge E-learning Solutions Pvt Ltd: A Focus on Emerging Technologies
SkillForge E-learning Solutions Pvt Ltd core focus is on bridging the gap between academia and
industry, ensuring learners are equipped with the practical skills required for real-world success.
Customized Curriculum: Recognizing the dynamic nature of the tech industry, SkillForge
adapts its curriculum to align with industry trends and demands.
Career Support: Beyond technical skills, SkillForge provides career guidance, job
placement assistance, and networking opportunities to help learners transition into their
desired roles.
Online Learning Platform: A robust online platform supports the delivery of courses,
provides access to learning materials, and facilitates interaction between learners and
mentors.
Flexible Learning Options: SkillForge offers both online and offline learning modes to
cater to different learner preferences.
Career Support Services: Comprehensive career guidance and placement assistance set
SkillForge apart from traditional e-learning platforms.
EXPERIENTIAL LEARNING
Internship Overview
As a Data Science Intern at SkillForge, I gained valuable experience in the field of data analysis
and Python programming. My role involved a combination of data research, coding, and
documentation, providing a comprehensive understanding of the data science pipeline.
My internship at SkillForge as a Data Science Intern with a focus on Python was an invaluable
experience that provided a solid foundation for my career. My role primarily involved conducting
in-depth data research and analysis to extract meaningful insights. I honed my Python programming
skills by developing various scripts for data cleaning, manipulation, and visualization. From
exploratory data analysis to building predictive models, I had the opportunity to work on a diverse
range of projects. Additionally, I gained proficiency in preparing comprehensive documentation to
effectively communicate findings and methodologies to both technical and non-technical
stakeholders.
One of the most significant challenges I encountered was handling large and complex datasets.
Learning to efficiently process and clean such data required meticulous attention to detail and
problem-solving abilities. Moreover, understanding the business context of the data was crucial for
drawing relevant conclusions. Collaborating with domain experts to gain insights into the data's
nuances helped me overcome this challenge.
During my internship, I acquired a strong foundation in data science methodologies and tools. I
became proficient in using Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn for
data manipulation, analysis, and visualization. Furthermore, I developed critical thinking and
problem-solving skills as I tackled various data-related challenges. The experience of working in a
dynamic team environment also enhanced my collaboration and communication skills.
o Conducted in-depth research to identify relevant data sources for specific projects.
o Extracted and cleaned data from various formats (CSV, Excel, databases) to ensure
data quality.
o Explored and analyzed datasets to uncover insights and trends using statistical
methods and visualization techniques.
Python Programming:
o Implemented data cleaning and preprocessing pipelines using Python libraries like
Pandas and NumPy.
o Built predictive models using machine learning algorithms (e.g., linear regression,
decision trees, random forests).
o Utilized Python libraries like Scikit-learn for model development and evaluation.
Documentation:
o Created clear and concise documentation for data pipelines, analysis steps, and
model development processes.
Data Research: Skill in conducting thorough data research and gathering relevant
information.
Documentation: Proficiency in creating clear and concise documentation for projects and
processes.
Teamwork: Ability to collaborate effectively with team members on data science projects.
Big Data Technologies: Exposure to tools like Hadoop, Spark, or cloud-based platforms.
Data Engineering: Experience with data pipelines, ETL processes, and database
management.
Natural Language Processing (NLP): Skills in text analysis and natural language
understanding.
Challenges Faced and Lessons Learned
Challenges
Data Quality and Consistency: Dealing with missing values, outliers, and inconsistencies
in data was a common challenge. This required significant data cleaning and preprocessing
efforts.
Exploratory Data Analysis (EDA): Extracting meaningful insights from complex datasets
can be time-consuming and requires a deep understanding of statistical methods.
Model Selection and Tuning: Choosing the right algorithm and optimizing its parameters
for a specific problem can be challenging.
Computational Resources: Handling large datasets often required efficient code and
potentially access to high-performance computing resources.
Time Management: Balancing multiple tasks, such as data exploration, model building,
and documentation, within tight deadlines can be stressful.
Lessons Learned
Domain Knowledge: Understanding the underlying business context helps in asking the
right questions and deriving valuable insights.
Iterative Process: Data science is an iterative process. Experimentation and refinement are
key to building effective models.
Version Control: Using tools like Git for code management is essential for collaboration
and tracking changes.
Continuous Learning: The field of data science is rapidly evolving, so staying updated with
the latest trends and techniques is important.
Jupyter Notebook
Jupyter Notebook is a powerful open-source web application that allows you to create and share
documents containing live code, equations, visualizations, and narrative text. It's widely used in data
science, machine learning, scientific computing, and education.
Key Features
Interactive Code Execution: You can write and run code directly in the notebook, and the
output is displayed immediately below the code cell.
Rich Text Format: Combines code with explanatory text, images, and mathematical
equations using Markdown.
Data Visualization: Easily create various types of plots and charts using libraries like
Matplotlib, Seaborn, and Plotly.
Kernel Support: Jupyter supports multiple programming languages (kernels) like Python,
R, Julia, and more.
Shareability: Notebooks can be shared as static HTML files or interactive web applications.
Collaboration: Multiple users can collaborate on the same notebook.
Jupyter Notebook provides a flexible and interactive environment for data scientists and researchers
to explore data, develop models, and communicate results effectively.
Numpy
NumPy (Numerical Python) is a fundamental Python library for numerical computing. It provides
high-performance multidimensional array objects and tools for working with these arrays. It's the
cornerstone for many scientific computing packages in Python.
Key Features
Multidimensional Arrays: NumPy's core data structure is the ndarray, which represents a
multidimensional array of homogeneous data types. This allows for efficient storage and
manipulation of large datasets.
Performance: NumPy is optimized for performance, often outperforming pure Python code
by several orders of magnitude, especially for numerical computations.
Integration: It seamlessly integrates with other scientific Python libraries like SciPy,
Pandas, and Matplotlib.
NumPy is essential for anyone working with numerical data in Python. Its efficiency, versatility,
and integration capabilities make it a powerful tool for data scientists, engineers, and researchers.
Pandas
Pandas is a Python library designed for data manipulation and analysis. It provides high-
performance, easy-to-use data structures and data analysis tools. Built on top of NumPy, Pandas
offers a flexible and efficient way to work with structured data.
Key Features
Data Import/Export: Pandas can read data from various file formats like CSV, Excel,
JSON, SQL databases, and more. It can also export data to these formats.
Data Cleaning and Preparation: Handles missing values, duplicates, outliers, and data
normalization effectively.
Data Manipulation: Offers functions for filtering, sorting, grouping, merging, and
reshaping data.
Time Series Analysis: Provides tools for working with time series data, including frequency
conversion, date range creation, and time-based calculations.
Data Visualization: While not as comprehensive as dedicated visualization libraries like
Matplotlib or Seaborn, Pandas provides basic plotting capabilities.
Performance: Built on NumPy, Pandas offers high performance for large datasets.
Efficiency: Pandas is optimized for performance, making it suitable for large datasets.
Integration: Works seamlessly with other Python libraries like NumPy, Matplotlib, and
Scikit-learn.
Pandas is an tool for data scientists and analysts who work with structured data. Its versatility,
performance, and ease of use make it a popular choice for data manipulation and analysis tasks.
Matplotlib
Matplotlib is a powerful and versatile Python library primarily used for creating static, animated,
and interactive visualizations. It offers a wide range of plotting functionalities, making it a go-to
tool for data scientists, engineers, and researchers to explore and understand their data.
Core Concepts
Figure: Represents the overall canvas or window where plots are displayed.
Axes: Defines the plotting area within a figure. Each plot has its own axes.
Plot: The actual visualization of data on the axes, such as lines, bars, scatter points, etc.
Key Features
Diverse Plot Types: Matplotlib supports a vast array of plot types, including:
Line plots, Scatter plots, Bar charts, Histograms, Pie charts, Box plots, Contour plots, 3D
plots
Customization: Offers extensive customization options to control every aspect of a plot,
including:
Line styles, colors, and markers, Axis labels, titles, and legends, Grids and ticks, Text and
annotations, Figure size and layout
Integration with NumPy: Seamlessly works with NumPy arrays for efficient data handling
and plotting.
Object-Oriented Approach: Provides both a stateful (pyplot) and object-oriented interface
for creating plots.
Seaborn
Seaborn is a Python data visualization library built on top of Matplotlib. It provides a high-level
interface for creating attractive and informative statistical graphics. Designed to work seamlessly
with Pandas data structures, Seaborn makes it easy to explore and understand your data through
visualization.
Key Features
High-level interface: Seaborn simplifies the process of creating complex visualizations with
just a few lines of code.
Attractive default styles: It comes with built-in themes and color palettes that enhance the
visual appeal of your plots.
Integration with Pandas: Seamlessly works with Pandas DataFrames, making data
exploration and visualization efficient.
Statistical graphics: Offers a wide range of statistical plot types, including scatter plots,
regression plots, histograms, heatmaps, and more.
Customization: Provides options for customizing plot elements like colors, labels, and
styles.
Core Concepts
Statistical Visualization: Seaborn excels at creating visualizations that reveal underlying statistical
relationships in your data.
Data Structures: It makes easy to create visualizations directly from your data.
Themes and Styles: Seaborn provides a consistent visual style for your plots through built-in
themes and color palettes.
Power BI is a robust business intelligence (BI) and data visualization toolset developed by
Microsoft. It empowers users to transform raw data into compelling, interactive insights that drive
informed decision-making.
Data Connectivity: Power BI can connect to a wide range of data sources, including Excel
spreadsheets, databases (SQL Server, Oracle, etc.), cloud-based data (Azure, Salesforce),
and online services (Google Analytics, etc.).
Data Modeling: Users can create complex data models by combining data from multiple
sources, defining relationships, and creating calculated columns and measures.
Data Transformation: Power Query, a powerful data integration tool, allows users to clean,
transform, and shape data before analysis.
Data Visualization: Power BI offers a rich set of visualizations, including charts, graphs,
maps, and custom visuals to represent data effectively.
Interactive Dashboards: Users can create dynamic and interactive dashboards that bring
together multiple visualizations to tell a story.
Natural Language Queries: With Power BI, users can ask questions in natural language to
get insights from data.
Collaboration: Power BI supports collaboration among teams, enabling sharing and
commenting on reports and dashboards.
AI and Machine Learning Integration: Power BI integrates with AI and machine learning
services to provide advanced analytics capabilities.
INTERNSHIP OUTCOMES
Working within a dynamic data science team at SkillForge exposed me to industry best
practices and collaborative work environments. I learned the importance of effective
communication and teamwork in delivering data-driven solutions. I also gained insights into
the business implications of data science projects, understanding how data can inform
strategic decision-making. This experience broadened my perspective on the role of data
science in driving business growth.
My internship at [Company Name] has solidified my passion for data science and equipped
me with the necessary skills to pursue a successful career in this field. I am eager to apply
my knowledge to tackle more complex and impactful projects. I aspire to become a
proficient data scientist who can leverage data to solve real-world problems and contribute
to innovative solutions. The experience gained during this internship has provided a strong
foundation for my future endeavors.
CONCLUSION
Overall, my internship has been a transformative experience. I have gained practical experience in
data science, developed a strong foundation in Python programming, and cultivated essential soft
skills such as problem-solving, critical thinking, and teamwork. The knowledge and skills acquired
during this internship will undoubtedly be invaluable as I pursue a career in data science. I am eager
to apply my learnings to future endeavours and contribute to innovative data-driven solutions.
BIBLIOGRAPHY
Python for Data Analysis by Wes McKinney: A foundational text for data manipulation
and analysis using Pandas.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien
Géron: Comprehensive coverage of machine learning techniques and their implementation
in Python.
Data Science from Scratch by Joel Grus: Provides a deep dive into the underlying
algorithms and techniques used in data science, implemented from scratch in Python.
Data Visualization with Python and Plotly by Nicholas McQuown: Focuses on creating
interactive and visually appealing data visualizations using the Plotly library.