Why to Use Pytho1
Why to Use Pytho1
The following are the primary factors to use python in day-to-day life:
1. Python is object-oriented
Structure supports such concepts as polymorphism, operation overloading
and multiple inheritance.
2. Indentation
Indentation is one of the greatest feature in python
3. It’s free (open source)
Downloading python and installing python is free and easy
4. It’s Powerful
Dynamic typing
Library utilities
No intermediate compile
PVM
m.pyc
m.py
Byte code extension is .pyc (Compiled python code)
• Interactive Mode
• Script Mode
Running Python in interactive mode:
Without passing python script file to the interpreter, directly execute code
to Python prompt. Once you’re inside the python interpreter, then you can
start.
>>> print("hello world") hello world
# Relevant output is displayed on subsequent lines without the >>> symbol
>>> x=[0,1,2]
# Quantities stored in memory are not displayed by default.
>>> x
#If a quantity is stored in memory, typing its name will display it. [0, 1,
2]
>>> 2+3
5
The chevron at the beginning of the 1st line, i.e., the symbol >>> is a prompt
the python interpreter uses to indicate that it is ready. If the programmer
types 2+6, the interpreter replies 8.
Running Python in script mode:
python MyFile.py
Working with the interactive mode is better when Python programmers deal with
small pieces of code as you can type and execute them immediately, but when
the code is more than 2-4 lines, using the script for coding can help to modify
and use the code in future.
Example:
Data types:
The data stored in memory can be of many types. For example, a student roll
number is stored as a numeric value and his or her address is stored as
alphanumeric characters. Python has various standard data types that are used
to define the operations possible on them and the storage method for each of
them.
Int:
Int, or integer, is a whole number, positive or negative, without
decimals, of unlimited length.
>>> print(24656354687654+2) 24656354687656
>>> print(20) 20
>>> print(0b10) 2
>>> print(0B10) 2
>>> print(0X20) 32
>>> 20
20
>>> 0b10 2
>>> a=10
>>> print(a) 10
# To verify the type of any object in Python, use the type() function:
>>> type(10)
<class 'int'>
>>> a=11
>>> print(type(a))
<class 'int'>
Float:
Float, or "floating point number" is a number, positive or negative,
containing one or more decimals.
Float can also be scientific numbers with an "e" to indicate the power of 10.
>>> y=2.8
>>> y 2.8
>>> y=2.8
>>> print(type(y))
<class 'float'>
>>> type(.4)
<class 'float'>
>>> 2.
2.0
Example:
x = 35e3 y = 12E4
z = -87.7e100
print(type(x)) print(type(y)) print(type(z))
Output:
<class 'float'>
<class 'float'>
<class 'float'>
Boolean:
Objects of Boolean type may have one of two values, True or False:
>>> type(True)
<class 'bool'>
>>> type(False)
<class 'bool'>
String:
>>> print('a\
....b')
a.. b
>>> print('a\ b\
c')
abc
>>> print('a \n b') a
b
>>> print("mrcet \n college") mrcet
college
In Python (and almost all other common computer languages), a tab character
can be specified by the escape sequence \t:
>>> print("a\tb") a b
List:
It is a general purpose most widely used in data structures
To use a list, you must declare it first. Do this using square brackets
and separate values with commas.
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
Dictionary
Dictionaries are used to store data values in key:value pairs.
A dictionary is a collection which is ordered*, changeable and do not allow
duplicates.
Dictionaries are written with curly brackets, and have keys and values:
Example
Create and print a dictionary:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)
Nested Dictionaries
A dictionary can contain dictionaries, this is called nested dictionaries.
ExampleGet your own Python Server
Create a dictionary that contain three dictionaries:
myfamily = {
"child1" : {
"name" : "Emil",
"year" : 2004
},
"child2" : {
"name" : "Tobias",
"year" : 2007
},
"child3" : {
"name" : "Linus",
"year" : 2011
}
}
1 Integer output
2 Addition output
3 Multiplication output
22 Output
50 Output
0 Remainder in output
'cdefghijk' Output
'abcd' Output
'bcde' Output
Feature Scaling
In Data Processing, we try to change the data in such a way that the model can
process it without any problems. And Feature Scaling is one such process in which
we transform the data into a better version. Feature Scaling is done to normalize
the features in the dataset into a finite range.
There are several ways to do feature scaling. I will be discussing the top 5 of
the most commonly used feature scaling techniques.
1. Absolute Maximum Scaling
2. Min-Max Scaling
3. Normalization
4. Standardization
5. Robust Scaling
If we do this for all the numerical columns, then all their values will lie between
-1 and 1. The main disadvantage is that the technique is sensitive to outliers.
Like consider the feature *square feet*, if 99% of the houses have square feet
area of less than 1000, and even if just 1 house has a square feet area of 20,000,
then all those other house values will be scaled down to less than 0.05.
I will be working with the sine and cosine functions throughout the article and
show you how the scaling techniques affect their magnitude. sin() will be ranging
between -1 and +1, and 50*cos() will be ranging between -50 and +50.
This is how they actually look, you will not even be able to see that the red one
is a sine graph, it basically looks like a straight squiggly line when compared
to the big blue graph.
y1_new = y1/max(y1)
y2_new = y2/max(y2)
See from the graph that now both the datasets are ranging from -1 to +1 after the
scaling.
This might become significantly small with many data points below even 0.01 even
if there is a single big outlier.
Min Max Scaling
In min-max you will subtract the minimum value in the dataset with all the values
and then divide this by the range of the dataset(maximum-minimum). In this case,
your dataset will lie between 0 and 1 in all cases whereas in the previous case,
it was between -1 and +1. Again, this technique is also prone to outliers.
y1_new = (y1-min(y1))/(max(y1)-min(y1))
y2_new = (y2-min(y2))/(max(y2)-min(y2))
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e1bf8fd30>]
Normalization
Instead of using the min () value in the previous case, in this case, we will be
using the average() value.
In scaling, you are changing the range of your data while in normalization you are
re changing the shape of the distribution of your data.
y1_new = (y1-np.mean(y1))/(max(y1)-min(y1))
y2_new = (y2-np.mean(y2))/(max(y2)-min(y2))
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e1bfb5518>]
Standardization
In standardization, we calculate the z-value for each of the data points and
replaces those with these values.
This will make sure that all the features are centred around the mean value with
a standard deviation value of 1. This is the best to use if your feature is
normally distributed like salary or age.
y1_new = (y1-np.mean(y1))/np.std(y1)
y2_new = (y2-np.mean(y2))/np.std(y2)
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e25e66e10>]
Robust Scaling
In this method, you need to subtract all the data points with the median value and
then divide it by the Inter Quartile Range(IQR) value.
IQR is the distance between the 25th percentile point and the 50th percentile
point.
This method centres the median value at zero and this method is robust to outliers.
from scipy import stats
IQR1 = stats.iqr(y1, interpolation = 'midpoint')
y1_new = (y1-np.median(y1))/IQR1
IQR2 = stats.iqr(y2, interpolation = 'midpoint')
y2_new = (y2-np.median(y2))/IQR2
plt.plot(x,y1_new,'red')
plt.plot(x,y2_new,'blue')
[<matplotlib.lines.Line2D at 0x7f6e25e19080>]
Is Feature Scaling actually helpful?
Let’s look at an example of a College Admission dataset, in which your goal is to
predict the chance of admission for each student based on the other features given.
You can download the dataset from the link below.
https://www.kaggle.com/mohansacharya/graduate-admissions
import pandas as pd
df = pd.read_csv("Admission_Predict.csv")
df.head()
The dataset has a wide variety of features with different ranges. The first
column Serial No. is not important, so I am going to be deleting it. Then I am
splitting the dataset into training and test dataset.
df.drop("Serial No.",axis=1,inplace=True)
y = df['Chance of Admit ']
df.drop("Chance of Admit ",axis=1,inplace=True)
from sklearn.model_selection import train_test_split
Dealing with outliers using the Z-Score method
Outlier detection is one of the widely used methods in any data science project,
as its presence can lead to the development of a bad machine learning model. Let’s
take a quick scenario of the linear regression problem statement, where suppose
you have to predict the person’s weight from the height. In general, one with more
height will also have more weight (linear positive trend), but what if rarely we
have 3-4 people who have much less weight, but comparatively more height than that
data will be treated as bad data or the outliers. Which, in the end, would not be
a good fit for our regression model.
There are various ways through which we can deal with the outliers, though, in
this article, we will have our complete focus on the Z-Score method. Here we will
talk about the limitations of this method, When to use and prefer other methods,
and of course, the complete
The data point with values Z > +3 and Z<-3 is considered to be the outlier which
may be removed as follows using python:-
mean=df.Height.mean()
mean
66.3675597548656
std_deviation=df.Height.std()
std_deviation
3.847528120795573
df['zscore']=(df.Height-df.Height.mean())/df.Height.std()
df.head()
Gender Height zscore
0 Male 73.847017 1.943964
1 Male 68.781904 0.627505
2 Male 74.110105 2.012343
3 Male 71.730978 1.393991
4 Male 69.881796 0.913375
new_df=df[(df.zscore<3) & (df.zscore>-3)]
outlier=df[(df.zscore<-3)|(df.zscore>3)]
new_df
Gender Height zscore
0 Male 73.847017 1.943964
1 Male 68.781904 0.627505
2 Male 74.110105 2.012343
3 Male 71.730978 1.393991
4 Male 69.881796 0.913375
... ... ... ...
9995 Female 66.172652 -0.050658
9996 Female 67.067155 0.181830
9997 Female 63.867992 -0.649655
9998 Female 69.034243 0.693090
9999 Female 61.944246 -1.149651
import pandas as pd
import seaborn as sn
df=pd.read_csv('c:/users/lenovo/desktop/height.csv')
df.head()
Gender Height
0 Male 73.847017
1 Male 68.781904
2 Male 74.110105
3 Male 71.730978
4 Male 69.881796
df.Height.describe()
count 10000.000000
mean 66.367560
std 3.847528
min 54.263133
25% 63.505620
50% 66.318070
75% 69.174262
max 78.998742
Name: Height, dtype: float64
sn.histplot(df.Height, kde=True)
<matplotlib.axes._subplots.AxesSubplot at 0x170b773e438>
mean=df.Height.mean()
mean
66.3675597548656
std_deviation=df.Height.std()
std_deviation
3.847528120795573
mean-3*std_deviation
54.824975392478876
mean+3*std_deviation
77.91014411725232
df[df.Height<54.82]
Gender Height
6624 Female 54.616858
9285 Female 54.263133
df[df.Height>77.91]
Gender Height
994 Male 78.095867
1317 Male 78.462053
2014 Male 78.998742
3285 Male 78.528210
3757 Male 78.621374
new_df=df[df.Height<77.91]
new_df
Gender Height
0 Male 73.847017
1 Male 68.781904
2 Male 74.110105
3 Male 71.730978
4 Male 69.881796
... ... ...
9995 Female 66.172652
9996 Female 67.067155
9997 Female 63.867992
9998 Female 69.034243
9999 Female 61.944246
x_train,x_test,y_train,y_test = train_test_split(df,y,test_size=0.2)
I am going to be building a linear regression model, first without normalization,
and next with normalization, let’s check whether there is any improvement in the
accuracy.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
from sklearn import metrics
rmse = np.sqrt(metrics.mean_squared_error(y_test,pred))
rmse
0.06845052747026953
See that without normalization the root mean squared error value comes out to be
0.0684, as most of the values in the `y` are less than 0.5.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(df)
df = sc.transform(df)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df,y,test_size=0.2)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
from sklearn import metrics
rmse = np.sqrt(metrics.mean_squared_error(y_test,pred))
rmse
0.05674870151306346
See that, we are able to get a significant reduction in the error when we used the
standardization technique.
Code / Output Remark
1 Integer output
2 Addition output
3 Multiplication output
22 Output
50 Output
0 Remainder in output
4 Remainder in output
'cdefghijk' Output
'abcd' Output
'bcde' Output
[3, 4] Output
4 Output
'matlab' Output
if 1==2:
print('first')
Understanding if---else
else:
print('last')
last Output
if 1!=2:
print('first')
Understanding if---else
else:
print('last')
first Output
if 1==2:
print('first')
elif 3==3:
Understanding if---else ladder
print('middle')
else:
print('last')
middle Output
fruits = ["apple", "banana", "cherry"]
for x in fruits: Understanding for loop
print(x)
apple
banana Output
cherry
seq=[1,2,3,4,5] Defining a list
for i in seq:
Using for loop
print(i)
1
2
3 Output
4
5
fruits = ["apple", "banana", "cherry"]
for x in fruits:
if x == "banana": Printing using for loop (Use of break)
break
print(x)
apple Output
fruits = ["apple", "banana", "cherry"]
for x in fruits:
if x == "banana": Printing using for loop (Use of continue)
continue
print(x)
apple
Output
cherry
for x in range(6):
Printing a range starts from 0
print(x)
0
1
2
Output
3
4
5
for x in range(2, 6):
For loop for two number
print(x)
2
3
Output
4
5
i=2
while i<5:
Printing using for loop
print('i is: {}'.format(i))
i=i+1
i is: 2
i is: 3 Output
i is: 4
nest=[1,2,[7,4],[5, ['Nisha', 'DSB'],
Nested List for practice
[100,200,['Hello']], 23, 11],1,7]
nest[3][2][1] Accessing 200 from the list
200 Output
d= {'k1':[1,2,3,{'tricky':['oh',
'man', 'inception', Defining a Dictionary
{'target':[1,2,3,'hello']}]}]}
d['k1'][3]['tricky'][3]['target'][0] Accessing 1 from the list
1 Output