Internship Report
Internship Report
(AUTONOMOUS)
INTERNSHIP REPORT
A report submitted in partial fulfilment of the requirements for the Award of
Degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE & ENGINEERING
(DATA SCIENCE)
By
N SWETHA REDDY
Regd.No.21781A3294
Under supervision of
Mr/Ms Sarvesh Agarwal (Founder&CEO)
(Duration: 11/06/2023 to 22/07/2023)
SRI VENKATESWARA COLLEGE OF ENGINEERING AND TECHNOLOGY
(AUTONOMOUS)
R.V.S.NAGAR, CHITTOOR – 517 127. (A.P)
(Approved by AICTE, New Delhi, Affiliated to
JNTUA, Anantapur)
(Accredited by NBA, New Delhi & NAAC, Bangalore)
(An ISO 9001:2000 Certified Institution)
2021-2022
CERTIFICATE
This is to certify that the “Internship report” submitted by N SWETHA
REDDY (Regd.No.:21781A3294) is work done by him and submitted
during 2022-2023.Academic year, in partial fulfilment of the
requirements for the award of the Degree of BACHELOR OF
TECHNOLOGY in COMPUTER SCIENCE & ENGINEERING
(DATA SCIENCE), at Intershala Trainings.
DR.M.LAVANYA
ORGANISATION INFORMATION:
Internshala is an internship and online training
platform, based in Gurgaon, India. Founded by
Sarvesh Agrawal, an IIT Madras alumnus, in 2010,
the website helps students find internships with
organizations in India. The platform started as a
WordPress blog which aggregated internships
across India and articles on education, technology
and skill gap in 2010. The website was launched in
2013. InternShala launched its online trainings in
2014.
ABOUT TRAINING:
ABOUT TRAINING
1ST WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
11/06/2023 Sunday Data science overview
12/06/2023 Monday Introduction to python
13/06/2023 Tuesday Understanding the operators
14/06/2023 Wednesday Variables and data types
15/06/2023 Thursday Conditional statements
16/06/2023 Friday Looping statements
17/06/2023 Saturday Functions
2ND WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
18/06/2023 Sunday Data structure
19/06/2023 Monday Lists and Dictionaries
20/06/2023 Tuesday Understanding standard libraries in python
21/06/2023 Wednesday Reading a CSV file in python
22/06/2023 Thursday Data frames and basic operators with data
frames
23/06/2023 Friday Indexing data frame
24/06/2023 Saturday Introduction to statistics
3RD WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
25/06/2023 Sunday Measures of central tendency
26/06/2023 Monday Understanding the spread of data
27/06/2023 Tuesday Data distribution
28/06/2023 Wednesday Introduction to probability
29/06/2023 Thursday Probabilities of discrete and continuous variable
30/06/2023 Friday Central limit theorem and Normal distribution
01/07/2023 Saturday Introduction to inferential statistics
4TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
02/07/2023 Sunday Understanding the confidence interval and margin
of error
03/07/2023 Monday Hypothesis testing
04/07/2023 Tuesday T tests and Chi squared tests
05/07/2023 Wednesday Understanding the concept of correlation
06/07/2023 Thursday Introduction to predictive modelling
07/07/2023 Friday Understanding the types of predictive models
08/07/2023 Saturday Stages of predictive models
5TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
09/07/2023 Sunday Hypothesis generation
10/07/2023 Monday Data extraction and Data exploration
11/07/2023 Tuesday Reading the data into python
12/07/2023 Wednesday Variable identification
13/07/2023 Thursday Unvariate analysis for continuous variables
14/07/2023 Friday Unvariate analysis for categorial variables
15/07/2023 Saturday Bivariate analysis and treating missing values
6TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
16/07/2023 Sunday Treating missing values and how to treat outliers
17/07/2023 Monday Transforming the variables
18/07/2023 Tuesday Basics of model building
19/07/2023 Wednesday Linear Regression
20/07/2023 Thursday Logistic Regression
21/07/2023 Friday Decision Trees and K-Means
22/07/2023 Saturday Final Project
MODULE-1: INTRODUCTION TO DATA
SCIENCE
DATA SCIENCE OVERVIEW:
Data science is the study of data. Like biological sciences is a
study of biology, physical sciences, it’s the study of physical
reactions. Data is real, data has real properties, and we need to
study them if we’re going to work on them. Data Science
involves data and some signs. It is a process, not an event. It is
the process of using data to understand too many different
understand the world.
Predictive modelling:
Predictive modelling is a form of artificial intelligence that
uses data mining and probability to forecast or estimate more
granular, specific outcomes. For example, predictive
modelling could help identify customers who are likely to
purchase our new One AI software over the next 90 days.
Machine Learning: Machine learning is a branch of artificial
intelligence (ai) where computers learn to act and adapt to
new data without being programmed to do so. The computer
is able to act independently of human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future
events based on past and present data and most commonly by
analysis of trends. "Guessing" doesn't cut it.
3.LOGICAL OPERATORS:
2.6. Functions
2.8. Lists
2.9. Dictionaries
Types of Statistics:
3.2. Measures of Central Tendency
Mean: It is measure of average of all value in a sample set.
Median: It is measure of central value of a sample set. In
these, data set is ordered from lowest to highest value and
then finds exact middle.
Mode: It is value most frequently arrived in sample set. The value
repeated most of time in central set is actually mode.
Range: It is given measure of how to spread apart values in sample
set or data set.
Range = Maximum value - Minimum value
3.3. Understanding the spread of data
The spread in data is the measure of how far the
numbers in a data set are away from the mean or
the median. The spread in data can show us how
much variation there is in the values of the data
set. It is useful for identifying if the values in the
data set are relatively close together or spread
apart.
3.4. Data Distribution
Data distribution is a function that specifies all possible
values for a variable and also quantifies the relative
frequency (probability of how often they occur).
Distributions are considered to be any population that
has a scattering of data. It’s important to determine the
population’s distribution so we can apply the correct
statistical methods when analysing it.
Boxplot:
It is based on the percentiles of the data as shown in the figure below.
The top and bottom of the boxplot are 75 Th and 25th percentile of
the data. The extended lines are known as whiskers that includes the
range of rest of the data.
Frequency Table:
It is a tool to distribute the data into equally spaced ranges, segments
and tells us how many values fall in each segment.
Histogram:
It is a way of visualizing data distribution through frequency
table with bins on the x-axis and data count on the y-axis
Density Plot:
It is related to histogram as it shows data-values being
distributed as continuous line. It is a smoothed histogram
version.
3.5. Introduction to Probability
Probability refers to the extent of occurrence of events. When
an event occurs like throwing a ball, picking a card from deck,
etc.., then the must be some probability associated with that
event.
3.11. T tests
A t test is a statistical test that is used to compare the means
of two groups. It is often used in hypothesis testing to
determine whether a process or treatment actually has an
effect on the population of interest, or whether two groups are
different from one another.
Unsupervised learning :
Unsupervised learning is the training of machine using
information that is neither classified nor labelled and allowing
the algorithm to act on that information without guidance.
Data Cleaning is the process in which we refine our data sets. In the
process of data cleaning, we remove un-necessary and erroneous data.
It involves removing the redundant data and duplicate data from our
data sets.
4.Build Predictive Model:
4.6.Data Exploration
Data exploration is the first step in data analysis involving the use
of data visualization tools and statistical techniques to uncover data
set characteristics and initial patterns.
First, identify
Predictor
(Input) and
Target
(output) variables. Next, identify the data type and category of the
variables. Example:- Suppose, we want to predict, whether the
students will play cricket or not (refer below data set)
4.9. Univariate Analysis for Continuous Variables:
In case of continuous variables, we need to understand the central
tendency and spread of the variable. These are measured using
various statistical metrics visualization methods as shown below:
Note:
Univariate analysis is also used to highlight missing and outlier
values. In the upcoming part of this series, we will look at methods to
handle missing and outlier values.
4.10. Univariate Analysis for Categorical
Variables:
For categorical variables, we’ll use frequency table to understand distribution
of each category. We can also read as percentage of values under each category.
It can be be measured using two metrics,
Count and Count%
4.11.Bivariate Analysis:
Bi-variate Analysis finds out the relationship between two variables. Here, we
look for association and disassociation between variables at a pre-defined
significance level. We can perform bi-variate analysis for any combination of
categorical and continuous variables.
Continuous & Continuous:
3. Data preprocessing
4. Data visualization
6. Model evaluation
7. Model prediction
EXAMPLE:
from matplotlib.colors import ListedColormap
X_set, y_set = xtest, ytest
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
OUTPUT:
4.18. Decision Trees:
Decision Trees (DTs) are a non-parametric supervised learning
method used for classification and regression. The goal is to create a
model that predicts the value of a target variable by learning simple
decision rules inferred from the data features.
4.19. K-Means:
K-means is an unsupervised learning method for clustering
data points. The algorithm iteratively divides data points into
K clusters by minimizing the variance in each cluster.
Figure 1: shows the representation of data of different items.
Figure 2: The items are grouped together.
FINAL PROJECT
Problem Statement:
Your client is a retail banking institution. Term deposits are a major
source of income for a bank.
A term deposit is a cash investment held at a financial institution.
Your money is invested for an agreed rate of interest over a fixed
amount of time, or term.
The bank has various outreach plans to sell term deposits to their
customers such as email marketing, advertisements, telephonic
marketing and digital marketing.
Telephonic marketing campaigns still remain one of the most
effective ways to reach out to people. However, they require huge
investment as large call centers are hired to actually execute these
campaigns. Hence, it is crucial to identify the customers most likely to
convert beforehand so that they can be specifically targeted via call.
You are provided with the client data such as: age of the client, their
job type, their marital status, etc. Along with the client data, you are
also provided with the information of the call such as the duration of
the call, day and month of the call, etc. Given this information, your
task is to predict if the client will subscribe to term deposit.
DATA:
Use this dataset to train the model. This file contains all the client and
call details as well as the target variable “subscribed”. You have to
train your model using this file.
2.test.csv:
Use the trained model to predict whether a new set of clients will
subscribe the term deposit.
DATA DICTIONARY
SOLUTION:
CONCLUSION
In conclusion, I can say that internship was a great
experience. Thanks to this project, I acquired deeper
knowledge concerning my technical skills.I am able to
develop the skill to build and assess data-based model.
TRAINING CERTIFICATE
Data is being regularly
collected by businesses
and companies for
transactions and through
website interactions.
Many companies face a
common challenge – to
analyze and categorize the
data that is collected and
stored. A data scientist
becomes the savior in a
situation of mayhem
like this. Companies can
progress a lot with proper
and efficient handling of
data, which results
in productivity.