0% found this document useful (0 votes)
47 views

Internship Report

The document appears to be an internship report submitted by N Swetha Reddy to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering (Data Science). It describes a 6-week online data science training program completed at Internshala, an internship and online training platform. The training covered topics such as Python, statistics, predictive modeling, and machine learning through video tutorials, assignments, assessments, and hands-on exercises. The report includes weekly overviews of the modules/topics covered in the training program each day of the internship from June 11th to July 22nd, 2023.

Uploaded by

123reddyswetha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Internship Report

The document appears to be an internship report submitted by N Swetha Reddy to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering (Data Science). It describes a 6-week online data science training program completed at Internshala, an internship and online training platform. The training covered topics such as Python, statistics, predictive modeling, and machine learning through video tutorials, assignments, assessments, and hands-on exercises. The report includes weekly overviews of the modules/topics covered in the training program each day of the internship from June 11th to July 22nd, 2023.

Uploaded by

123reddyswetha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 73

SRI VENKATESWARA COLLEGE OF ENGINEERING AND TECHNOLOGY

(AUTONOMOUS)

R.V.S.Nagar, Chittoor – 517 127. (A.P)


(Approved by AICTE, New Delhi, Affiliated to JNTUA,
Anantapur)
(Accredited by NBA, New Delhi & NAAC, Bangalore)
(An ISO 9001:2000 Certified Institution)
2022-2023

INTERNSHIP REPORT
A report submitted in partial fulfilment of the requirements for the Award of
Degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE & ENGINEERING
(DATA SCIENCE)
By
N SWETHA REDDY
Regd.No.21781A3294
Under supervision of
Mr/Ms Sarvesh Agarwal (Founder&CEO)
(Duration: 11/06/2023 to 22/07/2023)
SRI VENKATESWARA COLLEGE OF ENGINEERING AND TECHNOLOGY
(AUTONOMOUS)
R.V.S.NAGAR, CHITTOOR – 517 127. (A.P)
(Approved by AICTE, New Delhi, Affiliated to
JNTUA, Anantapur)
(Accredited by NBA, New Delhi & NAAC, Bangalore)
(An ISO 9001:2000 Certified Institution)
2021-2022

CERTIFICATE
This is to certify that the “Internship report” submitted by N SWETHA
REDDY (Regd.No.:21781A3294) is work done by him and submitted
during 2022-2023.Academic year, in partial fulfilment of the
requirements for the award of the Degree of BACHELOR OF
TECHNOLOGY in COMPUTER SCIENCE & ENGINEERING
(DATA SCIENCE), at Intershala Trainings.
DR.M.LAVANYA

Internship Coordinator Head of the


Department
(DATA SCIENCE)
CERTIFICAT
E
ACKNOWLEDGEMENT

 A Grateful thanks to Dr.R.Venkataswamy, Chairman of Sri


Venkateshwara College of Engineering & Technology(Autonomous) for
providing education in their esteemed institution. I wish to record my
deep sense of gratitude and profound thanks to our beloved Vice
Chairman, Sri R.V.Srinivas for his valuable support throughout the
course.
 I express our sincere thanks to Dr.M.MOHAN BABU, our beloved
principal for his encouragement and suggestion during the course of
study.
 With the deep sense of gratefulness, I acknowledge Dr.M.LAVANYA
Head of the Department, Computer Science Engineering (CSD), for
giving us inspiring guidance in undertaking internship.
 I express our sincere thanks to the internship coordinator Mr.
RADHAKRISHNA, for his keen interest, stimulating guidance, constant
encouragement with our work during all stages, to bring this report into
fruition.
 I wish to convey my gratitude and sincere thanks to all members for their
support and cooperation rendered for successful submission of report.
 Finally, I would like to express my sincere thanks to all teaching, non-
teaching faculty members, our parents, and friends and for all those who
have supported us to complete the internship successfully.
(NAME:N SWETHA REDDY)
(ROLL.NO. 21781A3294)

ORGANISATION INFORMATION:
Internshala is an internship and online training
platform, based in Gurgaon, India. Founded by
Sarvesh Agrawal, an IIT Madras alumnus, in 2010,
the website helps students find internships with
organizations in India. The platform started as a
WordPress blog which aggregated internships
across India and articles on education, technology
and skill gap in 2010. The website was launched in
2013. InternShala launched its online trainings in
2014.
ABOUT TRAINING:

The Data Science Training by Internshala is a 6week


online training program in which Internshala aim to
provide you with a comprehensive introduction to data
science. In this training program, you will learn the
basics of python, statistics, predictive modeling, and
machine learning. This training program has video
tutorials and is packed with assignments, assessments
tests, quizzes, and practice exercises for you to get a
hands-on learning experience.
INTRODUCTION TO ORGANIZATION

ABOUT TRAINING

Module-1: Introduction to Data Science


1.1. Data Science Overview

Module-2: Python for Data Science


2.1. Introduction to Python
2.2. Understanding Operators
2.3. Variables and Data Types
2.4. Conditional Statements
2.5. Looping Constructs
2.6. Functions
2.7. Data Structure
2.8. Lists
2.9. Dictionaries
2.10. Understanding Standard Libraries in Python
2.11. Reading a CSV File in Python
2.12. Data Frames and basic operations with Data Frames
2.13. Indexing Data Frame

Module-3: Understanding the Statistics for Data Science


3.1. Introduction to Statistics
3.2. Measures of Central Tendency
3.3. Understanding the spread of data
3.4. Data Distribution
3.5. Introduction to Probability
3.6. Probabilities of Discreet and Continuous Variables
3.7. Central Limit Theorem and Normal Distribution
3.8. Introduction to Inferential Statistics
3.9. Understanding the Confidence Interval and margin of error
3.10. Hypothesis Testing
3.11. T tests
3.12. Chi Squared Tests
3.13. Understanding the concept of Correlation

Module-4: Predictive Modeling and Basics of Machine Learning


4.1. Introduction to Predictive Modeling
4.2. Understanding the types of Predictive Models
4.3. Stages of Predictive Models
4.4. Hypothesis Generation
4.5. Data Extraction
4.6. Data Exploration
4.7. Reading the data into Python
4.8. Variable Identification
4.9. Univariate Analysis for Continuous Variables
4.10. Univariate Analysis for Categorical Variables
4.11. Bivariate Analysis
4.12. Treating Missing Values
4.13. How to treat Outliers
4.14. Transforming the Variables
4.15. Basics of Model Building
4.16. Linear Regression
4.17. Logistic Regression
4.18. Decision Trees
4.19. K-means
WEEKLY OVERVIEW OF INTERNSHIP
ACTIVITIES

1ST WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
11/06/2023 Sunday Data science overview
12/06/2023 Monday Introduction to python
13/06/2023 Tuesday Understanding the operators
14/06/2023 Wednesday Variables and data types
15/06/2023 Thursday Conditional statements
16/06/2023 Friday Looping statements
17/06/2023 Saturday Functions

2ND WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
18/06/2023 Sunday Data structure
19/06/2023 Monday Lists and Dictionaries
20/06/2023 Tuesday Understanding standard libraries in python
21/06/2023 Wednesday Reading a CSV file in python
22/06/2023 Thursday Data frames and basic operators with data
frames
23/06/2023 Friday Indexing data frame
24/06/2023 Saturday Introduction to statistics
3RD WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
25/06/2023 Sunday Measures of central tendency
26/06/2023 Monday Understanding the spread of data
27/06/2023 Tuesday Data distribution
28/06/2023 Wednesday Introduction to probability
29/06/2023 Thursday Probabilities of discrete and continuous variable
30/06/2023 Friday Central limit theorem and Normal distribution
01/07/2023 Saturday Introduction to inferential statistics

4TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
02/07/2023 Sunday Understanding the confidence interval and margin
of error
03/07/2023 Monday Hypothesis testing
04/07/2023 Tuesday T tests and Chi squared tests
05/07/2023 Wednesday Understanding the concept of correlation
06/07/2023 Thursday Introduction to predictive modelling
07/07/2023 Friday Understanding the types of predictive models
08/07/2023 Saturday Stages of predictive models
5TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
09/07/2023 Sunday Hypothesis generation
10/07/2023 Monday Data extraction and Data exploration
11/07/2023 Tuesday Reading the data into python
12/07/2023 Wednesday Variable identification
13/07/2023 Thursday Unvariate analysis for continuous variables
14/07/2023 Friday Unvariate analysis for categorial variables
15/07/2023 Saturday Bivariate analysis and treating missing values

6TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
16/07/2023 Sunday Treating missing values and how to treat outliers
17/07/2023 Monday Transforming the variables
18/07/2023 Tuesday Basics of model building
19/07/2023 Wednesday Linear Regression
20/07/2023 Thursday Logistic Regression
21/07/2023 Friday Decision Trees and K-Means
22/07/2023 Saturday Final Project
MODULE-1: INTRODUCTION TO DATA
SCIENCE
DATA SCIENCE OVERVIEW:
Data science is the study of data. Like biological sciences is a
study of biology, physical sciences, it’s the study of physical
reactions. Data is real, data has real properties, and we need to
study them if we’re going to work on them. Data Science
involves data and some signs. It is a process, not an event. It is
the process of using data to understand too many different
understand the world.

What is statistical modelling?


The statistical modelling process is a way of applying statistical
analysis to datasets in data science. The statistical model involves a
mathematical relationship between random and non-random variables.
A statistical model can provide intuitive visualizations that aid data
scientists in identifying relationships between variables and making
predictions by applying statistical models to raw data. Examples of
common data sets for statistical analysis include census data, public
health data, and social media data.
What is meant by statistical computing?
Computational statistics, or statistical computing, is the bond
between statistics and computer science. It means statistical
methods that are enabled by using computational methods. It
is the area of computational science (or scientific computing)
specific to the mathematical science of statistics.

Predictive modelling:
Predictive modelling is a form of artificial intelligence that
uses data mining and probability to forecast or estimate more
granular, specific outcomes. For example, predictive
modelling could help identify customers who are likely to
purchase our new One AI software over the next 90 days.
Machine Learning: Machine learning is a branch of artificial
intelligence (ai) where computers learn to act and adapt to
new data without being programmed to do so. The computer
is able to act independently of human interaction.

Forecasting:
Forecasting is a process of predicting or estimating future
events based on past and present data and most commonly by
analysis of trends. "Guessing" doesn't cut it.

Applications of Data Science:


Data science and big data are making an undeniable impact
on businesses, changing day-to-day operations, financial
analytics, and especially interactions with customers. It's clear
that businesses can gain enormous value from the insights
data science can provide. But sometimes it's hard to see
exactly how. Solet's look at some examples. In this era of big
data, almost everyone generates masses of data every day,
often without being aware of it. This digital trace reveals the
patterns of our online lives. If you have ever searched for or
bought a production a site like Amazon, you'll notice that it
starts making recommendations related to your search.

MODULE-2: PYTHON FOR DATA


SCIENCE
2.1. Introduction to Python :
Python is a high-level, general-purpose and a very popular
programming language. Python programming language (latest Python
3) is being used in web development, Machine Learning applications,
along with all cutting-edge technology in Software Industry
Below are some facts about Python Programming Language:

.Python is currently the most widely used multi-purpose, highlevel


programming language. Python allows programming in Object-Oriented and
Procedural paradigms.

.Python language is being used by almost all tech-giant companies like –


Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.
.The biggest strength of Python is huge collection of standard libraries which
can be used for the following: Machine Learning GUI Applications (like Kivy ,
Tkinter, PyQt etc. )

2.2. Understanding operators:


1. ARITHMETIC OPERATORS:

Arithmetic operators are used to perform mathematical operations


like addition, subtraction, multiplication and division.
2.RELATIONAL OPERATORS:

Relational operators compare the values. It either returns True or


False according to the Condition.

3.LOGICAL OPERATORS:

Logical operators perform Logical AND Logical OR and Logical


NOT operations. OPERATOR DESCRIPTION SYNTAX and
Logical AND: True if both the operands are true x and y or Logical
OR: True if either of the operands is true x or y not Logical NOT:
True if operand is false not x.
4.BITWISE OPERATORS:

Bitwise operators act on bits and performs bit by bit operation.


OPERATOR DESCRIPTION SYNTAX & Bitwise AND x & y|
Bitwise OR x | y~ Bitwise NOT ~x^ Bitwise XOR x ^ y>> Bitwise
right shift x>><< Bitwise left shift x<<
5.ASSIGNMENT OPERATORS:

Assignment operators are used to assign values to the variables.


OPERATOR DESCRIPTION SYNTAX Assign value of right side of
expression to left side operand x = y + z+=Add AND: Add right side
operand with left side operand and then assign to left operand a+=b
a=a+b-=Subtract AND: Subtract right operand from left operand and
then assign to left operand a-=b a=a-*=Multiply AND: Multiply right
operand ivied(floor) AND: Divide left operand with right operand and
then assign the value(floor) to left operand a//=b a=a//b**=Exponent
AND: Calculate exponent(raise power) value using operands and
assign value to left operand a**=b a=a**b&=Performs Bitwise AND
on operands and assign value to left operand a&=b a=a&b|=Performs
Bitwise OR on operands.

2.3. Variables and Data Types


Variables and data types in python as the name suggests are the
values that vary. In a programming language, a variable is a memory
location where you store a value. The value that you have stored may
change in the future according to the specifications.
2.4. Conditional Statements
Conditional statements (if, else, and elif) are fundamental
programming constructs that allow you to control the flow of
your program based on conditions that you specify. They
provide a way to make decisions in your program and execute
different code based on those decisions.

2.5. Looping Constructs

Two types of looping constructs exist in Python– while loop


and for loop. The 'while statement' allows one to perform
repeated execution of a block of statements as long as a
condition is true.

2.6. Functions

A function is a block of code which only runs when it is


called.
You can pass data, known as parameters, into a function.
A function can return data as a result.

2.7. Data Structure

Data Structures are a way of organizing data so that it can


be accessed more efficiently depending upon the situation.
Data Structures are fundamentals of any programming
language around which a program is built. Python helps to
learn the fundamental of these data structures in a simpler
way as compared to other programming languages.

2.8. Lists

Lists are used to store multiple items in a single variable.

Lists are one of 4 built-in data types in Python used to store


collections of data, the other 3 are Tuple, Set, and Dictionary, all with
different qualities and usage.

Lists are created using square brackets

2.9. Dictionaries

Dictionaries are used to store data values in key:value pairs.


A dictionary is a collection which is ordered*, changeable and
do not allow duplicates.

2.10. Understanding Standard Libraries in


Python
Python standard library. The Python Standard
Library contains the exact syntax, semantics, and tokens of
Python. It contains built-in modules that provide access to
basic system functionality like I/O and some other core
modules. Most of the Python Libraries are written in the C
programming language.

2.11. Reading a CSV File in Python


There are various ways to read a CSV file that uses either the
CSV module or the pandas library.
 csv Module: The CSV module is one of the modules in
Python which provides classes for reading and writing
tabular information in CSV file format.
 pandas Library: The pandas library is one of the open-
source Python libraries that provide high-performance,
convenient data structures and data analysis tools and
techniques for Python programming.

2.12. Data Frames and basic operations with Data


Frames
Data Frames are generic data objects of R which are used to store
the tabular data. Data frames are considered to be the most popular
data objects in R programming because it is more comfortable to
analyse the data in the tabular form. Data frames can also be taught
as mattresses where each column of a matrix can be of the different
data types. Data Frame are made up of three principal components,
the data, rows, and columns.

Operations that can be performed on a Data Frame are:

 Creating a Data Frame


 Accessing rows and columns
 Selecting the subset of the data frame
 Editing data frames
 Adding extra rows and columns to the data frame
 Add new variables to data frame based on existing ones
 Delete rows and columns in a data frame

2.13. Indexing Data Frame


Indexing in pandas means simply selecting particular rows and
columns of data from a Data Frame. Indexing could mean selecting
all the rows and some of the columns, some of the rows and all of the
columns, or some of each of the rows and columns. Indexing can
also be known as Subset Selection.

MODULE-3: UNDERSTANDING THE


STATISTICS FOR DATASCIENCE
3.1. Introduction to Statistics
Statistics: simply means numerical data, and is field of math that
generally deals with collection of data, tabulation, and interpretation
of numerical data. It is actually a form of mathematical analysis that
uses different quantitative models to produce asset of experimental
data or studies of real life. Basic terminology of Statistics.
Population – It is actually a collection of set of individuals or
objects or events whose properties are to be analysed.
Sample – It is the subset of a population.

Types of Statistics:
3.2. Measures of Central Tendency
Mean: It is measure of average of all value in a sample set.
Median: It is measure of central value of a sample set. In
these, data set is ordered from lowest to highest value and
then finds exact middle.
Mode: It is value most frequently arrived in sample set. The value
repeated most of time in central set is actually mode.
Range: It is given measure of how to spread apart values in sample
set or data set.
Range = Maximum value - Minimum value
3.3. Understanding the spread of data
The spread in data is the measure of how far the
numbers in a data set are away from the mean or
the median. The spread in data can show us how
much variation there is in the values of the data
set. It is useful for identifying if the values in the
data set are relatively close together or spread
apart.
3.4. Data Distribution
Data distribution is a function that specifies all possible
values for a variable and also quantifies the relative
frequency (probability of how often they occur).
Distributions are considered to be any population that
has a scattering of data. It’s important to determine the
population’s distribution so we can apply the correct
statistical methods when analysing it.
Boxplot:
It is based on the percentiles of the data as shown in the figure below.
The top and bottom of the boxplot are 75 Th and 25th percentile of
the data. The extended lines are known as whiskers that includes the
range of rest of the data.

Frequency Table:
It is a tool to distribute the data into equally spaced ranges, segments
and tells us how many values fall in each segment.

Histogram:
It is a way of visualizing data distribution through frequency
table with bins on the x-axis and data count on the y-axis
Density Plot:
It is related to histogram as it shows data-values being
distributed as continuous line. It is a smoothed histogram
version.
3.5. Introduction to Probability
Probability refers to the extent of occurrence of events. When
an event occurs like throwing a ball, picking a card from deck,
etc.., then the must be some probability associated with that
event.

3.6. Probabilities of Discreet and


Continuous Variables
Discrete distribution is a probability distribution where
the random variable can only take on a finite or
countable number of values. In contrast, continuous
distribution refers to a probability distribution where the
random variable can take on any value within a certain
range or interval.
3.7. Central Limit Theorem and Normal
Distribution
The central limit theorem (CLT) states that the
distribution of sample means approximates a normal
distribution as the sample size gets larger, regardless of
the population's distribution.

3.8. Introduction to Inferential Statistics


In Inferential statistics, we make an inference from a sample
about the population. The main aim of inferential statistics is
to draw some conclusions from the sample and generalise
them for the population data.

3.9. Understanding the Confidence


Interval and margin of error
Confidence Interval = x +/- z*(s/√n)
The margin of error can be calculated in two ways,
depending on whether you have parameters from a
population or statistics from a sample:

1. Margin of error (parameter) = Critical value


x Standard deviation for the population.
2. Margin of error (statistic) = Critical value
x Standard error of the sample.
3.10. Hypothesis Testing

Hypothesis Testing is a type of statistical analysis in which


you put your assumptions about a population parameter to the
test. It is used to estimate the relationship between 2 statistical
variables.

3.11. T tests
A t test is a statistical test that is used to compare the means
of two groups. It is often used in hypothesis testing to
determine whether a process or treatment actually has an
effect on the population of interest, or whether two groups are
different from one another.

3.12. Chi Squared Tests


A chi-square test is a statistical test that is used to
compare observed and expected results. The goal of this
test is to identify whether a disparity between actual and
predicted data is due to chance or to a link between the
variables under consideration. As a result, the chi-
square test is an ideal choice for aiding in our
understanding and interpretation of the connection
between our two categorical variables.

3.13. Understanding the concept of


Correlation
Correlation is a statistical measure that expresses the
extent to which two variables are linearly related
(meaning they change together at a constant rate). It’s a
common tool for describing simple relationships
without making a statement about cause and effect.

MODULE-4: PREDICTIVE MODELING


AND BASICS OF MACHINE
LEARNING

4.1. Introduction to Predictive Modelling

Predictive analytics involves certain manipulations ondata


from existing data sets with the goal of identifying some new
trends and patterns. These trends and patterns are then used to
predict future outcomes and trends. By performing predictive
analysis, we can predict future trends and performance. It is
also defined as the prognostic analysis, the word prognostic
means prediction.

4.2. Understanding the types of


Predictive Models
Supervised learning:
Supervised learning as the name indicates the presence of a
supervisor as a teacher. Basically supervised learning is a
learning in which we teach or train the machine using data
which is well labelled that means some data is already tagged
with the correct answer.

Unsupervised learning :
Unsupervised learning is the training of machine using
information that is neither classified nor labelled and allowing
the algorithm to act on that information without guidance.

4.3. Stages of Predictive Models


Stages to perform predictive analysis:

Some basic steps should be performed in order to perform predictive


analysis.

1.Define Problem Statement:


Define the project outcomes, the scope of the effort, objectives,
identify the datasets that going to be used.
2.Data Collection:

Data collection involves gathering the necessary details required for


the analysis. It involves the historical or past data from an authorized
source over which predictive analysis is to be performed.
3.Data Cleaning:

Data Cleaning is the process in which we refine our data sets. In the
process of data cleaning, we remove un-necessary and erroneous data.
It involves removing the redundant data and duplicate data from our
data sets.
4.Build Predictive Model:

In this stage of predictive analysis, we use various algorithms to build


predictive models based on the patterns observed. It requires
knowledge of python, Statistics and MATLAB and so on.
5.Model Monitoring:

Regularly monitor your models to check performance and ensure that


we have proper results. It is seeing how model predictions are
performing against actual data sets.

4.4. Hypothesis Generation


A hypothesis is a function that best describes the target in supervised
machine learning. The hypothesis that an algorithm would come up
depends upon the data and also depends upon the restrictions and bias
that we have imposed on the data.

4.5. Data Extraction


In general terms, “Mining” is the process of extraction of some
valuable material from the earth e.g., coal mining, diamond mining
etc. In the context of computer science, “Data Mining” refers to the
extraction of useful information from a bulk of data or data
warehouses. One can see that the term itself is a little bit confusing. In
case of coal or diamond mining, the result of extraction process is
coal or diamond. But in case of Data Mining, the result of extraction
process is not data!!Instead, the result of data mining is the patterns
and knowledge that we gain at the end of the extraction process. In
that sense, Data Mining is also known as Knowledge Discovery or
Knowledge Extraction.

4.6.Data Exploration

Data exploration is the first step in data analysis involving the use
of data visualization tools and statistical techniques to uncover data
set characteristics and initial patterns.

During exploration, raw data is typically reviewed with a combination


of manual workflows and automated data-exploration techniques
to visually explore data sets, look for similarities, patterns and outliers
and to identify the relationships between different variables.

Data Explorate Steps of Data Exploration and Preparation:

Remember the quality of your inputs decide the quality of your


output. So, once you have got your business hypothesis ready, it
makes sense to spend lot of time and efforts here. With my personal
estimate, data exploration, cleaning and preparation can take up to
70% of your total project time. Below are the steps involved to
understand, clean and prepare your data for building your predictive
model:
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation Finally, we will need to iterate over steps 4 – 7
multiple times before we come up with our refined model.

4.7. Reading the data into python


Python provides inbuilt functions for creating, writing and reading
files. There are two types of files that can be handled in python,
normal text files and binary files(written in binary language, 0s and
1s).
Text files:
In this type of file, each line of text is terminated with a special character called
EOL (End of Line), which is the new line character (‘\n’) in python by default.
Binary files:
In this type of file, there is no terminator for a line and the data is stored after
converting it into machine-understandable binary language. Access modes
govern the type of operations possible in the opened file. It refers to how the file
will be used once it’s opened. These modes also define the location of the File
Handle in the file. File handle is like a cursor,which defines from where the
data has to be read or written in the file. Different access modes for reading a
file are –1.
Read Only (‘r’):
Open text file for reading. The handle is positioned at the beginning of the file.
If the file does not exist, raises I/O error. This is also the default mode in which
file is opened.2.
Read and Write (‘r+’):
Open the file for reading and writing. The handle is positioned at the beginning
of the file. Raises I/O error if the file does not exist.
Append and Read (‘a+’):
Open the file for reading and writing. The file is created if it does not exist. The
handle is positioned at the end of the file. The data being written will be inserted
at the end, after the existing data.

4.8. Variable Identification

First, identify
Predictor
(Input) and
Target
(output) variables. Next, identify the data type and category of the
variables. Example:- Suppose, we want to predict, whether the
students will play cricket or not (refer below data set)
4.9. Univariate Analysis for Continuous Variables:
In case of continuous variables, we need to understand the central
tendency and spread of the variable. These are measured using
various statistical metrics visualization methods as shown below:
Note:
Univariate analysis is also used to highlight missing and outlier
values. In the upcoming part of this series, we will look at methods to
handle missing and outlier values.
4.10. Univariate Analysis for Categorical
Variables:
For categorical variables, we’ll use frequency table to understand distribution
of each category. We can also read as percentage of values under each category.
It can be be measured using two metrics,
Count and Count%

against each category. Bar chart can be used as visualization.

4.11.Bivariate Analysis:
Bi-variate Analysis finds out the relationship between two variables. Here, we
look for association and disassociation between variables at a pre-defined
significance level. We can perform bi-variate analysis for any combination of
categorical and continuous variables.
Continuous & Continuous:

While doing bi-variate analysis between two continuous variables, we


should look at scatter plot. It is a nifty way to find out the relationship
between two variables. The pattern of scatter plot indicates the
relationship between variables.

To find the strength of the relationship, we use Correlation.


Correlation varies between -1 and +1,-1: perfect negative linear
correlation,+1:perfect positive linear correlation and ,0: No
correlation Correlation can be derived using following formula:
Correlation = Covariance (X, Y) / SQRT(Var(X)* Var(Y))

Various tools have function or functionality to identify correlation


between variables. In Excel, function CORREL () is used to return the
correlation between two variables and SAS uses procedure PROC
CORR to identify the correlation. These function returns Pearson
Correlation value to identify the relationship between two variables:
X 65 72 78 65 72 70 65 68
Y 72 69 79 69 84 75 60 73

Metrices Formula Value


Co-variance(X,Y) =COVAR(E6:L6,E7:L 18.77
7)
Variance(X) =VAR.P(E6:L6) 18.48
Variance(Y) =VAR.P(E7:L7) 45.23
Correlation =G10/ 0.65
SQRT(G11*G12)
In above example, we have good positive relationship (0.65) between
two variables X and Y.
Categorical & Categorical:
To find the relationship between two categorical variables, we can use
following methods:
• Two-way table:

We can start analyzing the relationship by creating a two-way table


of count& count%. The rows represent the category of one variable
&the columns represent the categories of the other variable.
Stacked Column Chart:

This method is more of a visual form of Two-way table.


ANOVA:
It assesses whether the average of more than two groups is
statistically different.

4.12. Treating Missing Values

Missing values can be handled by deleting the rows or columns


having null values. If columns have more than half of the rows as null
then the entire column can be dropped. The rows which are having
one or more columns values as null can also be dropped.
Missing completely at random:
This is a case when the probability of missing variable is same for all
observations. For example: respondents of data collection process
decide that they will declare their earning after tossing a fair coin. If
ahead occurs, respondent declares his / her earnings & vice versa.
Here each observation has equal chance of missing value.
Missing at random:
This is a case when variable is missing at random and missing ratio
varies for different values / level of other input variables. For
example: We are collecting data for age and female has higher
missing value compare to male.
Missing that depends on unobserved predictors:
This is a case when the missing values are not random and are related
to the unobserved input variable. For example: In a medical study, if a
particular diagnostic causes discomfort, then there is higher chance of
drop out from the study. This missing value is not at random unless
we have included “discomfort” as an input variable for all patients.

4.13. How to treat Outliers:


An Outlier is an observation in a given dataset that lies far
from the rest of the observations. That means an outlier is
vastly larger or smaller than the remaining values in the set.
Outliers are treated by either deleting them or replacing the
outlier values with a logical value as per business and similar
data.
4.14. Transforming the variables:

A variable transformation defines a transformation that is used


to some values of a variable. In other terms, for every object,
the revolution is used to the value of the variable for that
object.
4.15. Basics of Model Building:

Steps to build the basics of model building:

1. Loading the dataset

2. Understanding the dataset

3. Data preprocessing

4. Data visualization

5. Building a regression model

6. Model evaluation

7. Model prediction

4.16. Linear Regression

Linear regression is an approach for predicting a response using


a single feature. It is one of the most basic machine learning models
that a machine learning enthusiast gets to know about. In linear
regression, we assume that the two variables i.e. dependent and
independent variables are linearly related.
It is a machine learning algorithm based on supervised regression
algorithm . Regression models a target prediction value based on
independent variables. It is mostly used for finding out the
relationship between variables and forecasting. Different regression
models differ based on – the kind of relationship between the
dependent and independent variables, they are considering and the
number of independent variables being used.
EXAMPLE:
X 0 1 2 3 4 5 6 7 8
Y 1 3 2 5 7 8 8 9 10

For generality, we define:


x as feature vector, x = [x_1, x_2, …., x_n],
y as response vector, y = [y_1, y_2, …., y_n]
4.17. Logistic Regression:

Logistic regression aims to solve classification problems. It


does this by predicting categorical outcomes, unlike linear
regression that predicts a continuous outcome.
Logistic regression is basically a supervised classification
algorithm . In a classification problem, the target variable(or
output), y, can take only discrete values for a given set of
features(or inputs), X.

EXAMPLE:
from matplotlib.colors import ListedColormap
X_set, y_set = xtest, ytest
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, classifier.predict(


np.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape), alpha = 0.75, cmap = ListedColormap(('red',
'green')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)

plt.title('Classifier (Test set)')


plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

OUTPUT:
4.18. Decision Trees:
Decision Trees (DTs) are a non-parametric supervised learning
method used for classification and regression. The goal is to create a
model that predicts the value of a target variable by learning simple
decision rules inferred from the data features.

Decision Tree Algorithm


In a decision tree, which resembles a flowchart, an inner node
represents a variable (or a feature) of the dataset, a tree branch
indicates a decision rule, and every leaf node indicates the
outcome of the specific decision. The first node from the top
of a decision tree diagram is the root node. We can split up
data based on the attribute values that correspond to the
independent characteristics.The recursive partitioning method
is for the division of a tree into distinct elements. Making
decisions is aided by this decision tree's comprehensive
structure, which looks like a flowchart. It offers a
diagrammatic model that exactly mirrors how individuals
reason and choose. Because of this property of the flowchart,
decision trees are easy to understand and comprehend.

4.19. K-Means:
K-means is an unsupervised learning method for clustering
data points. The algorithm iteratively divides data points into
K clusters by minimizing the variance in each cluster.
Figure 1: shows the representation of data of different items.
Figure 2: The items are grouped together.

FINAL PROJECT
Problem Statement:
Your client is a retail banking institution. Term deposits are a major
source of income for a bank.
A term deposit is a cash investment held at a financial institution.
Your money is invested for an agreed rate of interest over a fixed
amount of time, or term.
The bank has various outreach plans to sell term deposits to their
customers such as email marketing, advertisements, telephonic
marketing and digital marketing.
Telephonic marketing campaigns still remain one of the most
effective ways to reach out to people. However, they require huge
investment as large call centers are hired to actually execute these
campaigns. Hence, it is crucial to identify the customers most likely to
convert beforehand so that they can be specifically targeted via call.
You are provided with the client data such as: age of the client, their
job type, their marital status, etc. Along with the client data, you are
also provided with the information of the call such as the duration of
the call, day and month of the call, etc. Given this information, your
task is to predict if the client will subscribe to term deposit.

DATA:

You are provided with following files:


1.train.csv:

Use this dataset to train the model. This file contains all the client and
call details as well as the target variable “subscribed”. You have to
train your model using this file.
2.test.csv:
Use the trained model to predict whether a new set of clients will
subscribe the term deposit.
DATA DICTIONARY

Here is the description of all the variables.


variable Definition
ID Unique client ID
Age Age of the client
Job Type of job
Marital Marital status of the client
Education Education level
Default Credit in default
Housing Housing loan
Loan Personal loan
Contact Type of communication
Month Contact month
Day of week Day of week of contact
Duration Contact duration
Campaign Number of contacts perfomed
during this campaign to the
contact
P days Number of days that passed by
after the client Was last
contacted
Previous Number of contacts performed
before this Campaign
Outcome Outcome of the previous
marketing campaign
Subscribed(target) Has the client subscribed a term
deposit

SOLUTION:
CONCLUSION
In conclusion, I can say that internship was a great
experience. Thanks to this project, I acquired deeper
knowledge concerning my technical skills.I am able to
develop the skill to build and assess data-based model.

Few factors that point out


to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out
to data science’s future,
demonstrating compelling
reasons why it is
crucial to today’s business
needs
Few factors that point out to data science feature are:
•Companies inability to handle data:

Data is being regularly


collected by businesses
and companies for
transactions and through
website interactions.
Many companies face a
common challenge – to
analyze and categorize the
data that is collected and
stored. A data scientist
becomes the savior in a
situation of mayhem
like this. Companies can
progress a lot with proper
and efficient handling of
data, which results
in productivity.
Data is being regularly
collected by businesses
and companies for
transactions and through
website interactions.
Many companies face a
common challenge – to
analyze and categorize the
data that is collected and
stored. A data scientist
becomes the savior in a
situation of mayhem
like this. Companies can
progress a lot with proper
and efficient handling of
data, which results
in productivity.
Data is being regularly
collected by businesses
and companies for
transactions and through
website interactions.
Many companies face a
common challenge – to
analyze and categorize the
data that is collected and
stored. A data scientist
becomes the savior in a
situation of mayhem
like this. Companies can
progress a lot with proper
and efficient handling of
data, which results
in productivity.
Data is being regularly collected by businesses and
companies for transactions and through website
interactions.Many companies face a common challenge
to analyse and categorize that the data is collected and
stored.Companies can progress a lot with proper and
efficient handling of data which results in productivity.
•Data Science is constantly evolving:

Career areas that do not


carry any growth potential
in them run the risk of
stagnating. This
indicates that the
respective fields need to
constantly evolve and
undergo a change for
opportunities to arise and
flourish in the industry.
Data science is a broad
career path that is
undergoing developments
and thus promises
abundant opportunities in
the future. Data science
job roles are likely to get
more specific, which in
turn will lead to
specializations in the field.
People inclined towards
this stream can exploit
their opportunities and
pursue what suits them
best through these
specifications and
specializations.
Career areas that do not
carry any growth potential
in them run the risk of
stagnating. This
indicates that the
respective fields need to
constantly evolve and
undergo a change for
opportunities to arise and
flourish in the industry.
Data science is a broad
career path that is
undergoing developments
and thus promises
abundant opportunities in
the future. Data science
job roles are likely to get
more specific, which in
turn will lead to
specializations in the field.
People inclined towards
this stream can exploit
their opportunities and
pursue what suits them
best through these
specifications and
specializations.
Data science is a broad career path that is
undergoing development and thus promises
abundant opportunities in the future.

TRAINING CERTIFICATE
Data is being regularly
collected by businesses
and companies for
transactions and through
website interactions.
Many companies face a
common challenge – to
analyze and categorize the
data that is collected and
stored. A data scientist
becomes the savior in a
situation of mayhem
like this. Companies can
progress a lot with proper
and efficient handling of
data, which results
in productivity.

You might also like