Computer School
Computer School
Database
Previously, we stored data in files, which had several limitations, including data redundancy (data
duplication), data inconsistency, un-sharable data, unstandardized data, insecure data, incorrect data,
and so on. That is why we use a database (collection of data) management system to assist with data
storage and management.
Relation: table
Domain: given value in column
Tuple: rows
Attribute: column
Degree: number of attributes
Cardinality: number of tuples
Primary key: helps in identifying unique tuples
Candidate keys: all attribute combinations inside a relation.
Alternate keys: A candidate key that is not primary key
Foreign key: if you access primary key of a table to another table
COMPUTER
SQL will be used if you wish to add or edit something. We may accomplish this by accessing SQL
statements, and each statement contains a command. My SQL is a SQL-based relational
database management system that is free and open source, where you can write code.
My SQL s fast, reliable and shareable
Data type are means to identify the type of data and associated operations for handling it.
Creating table
structure table
Inserting value
2 ways -
If you want to write in one time, you will type column first and then the value
Display table
Select <whatever you want to> from <which table> where <condition to satisfy>;
Add column
Modifying rows
Modifying column
Alter table <table name> modify <column name> <data type> (new size);
Delete rows
Delete column
Delete table
Functions
A function is used to perform some particular tasks and returns zero or more values as a result. It is
available in two types: single-row functions and aggregate functions.
Also known as scalar functions. applied to a single value and returns a single value,
COMPUTER
COMPUTER
Aggregate Function
Aggregate functions are also called multiple row functions. These functions work on a set of records as a
whole.
COMPUTER
COMPUTER
Group by
Group by clause – is used to fetch a group of rows on the basis of common values in a column, It
groups the rows together that contain the same values in a specified column.
HAVING Clause in SQL is used to specify conditions on the rows with GROUP BY clause.
Select * from <name>, <name>; this is for cartesian product like if you want to select all combination
form something
COMPUTER
Ch 2 Data handling
NumPy, which stands for ‘Numerical Python’, is a library. it is a package that can be used for
numerical data analysis and scientific computing.
PANDAS (PANel DAta) is a high-level data manipulation tool used for analysing data. It is very
easy to import and export data using Pandas library which has a very rich set of functions.
Matplotlib library in Python is used for plotting graphs and visualisation.
1. A Numpy array requires homogeneous data, while a Pandas DataFrame can have different data
types (float, int, string, datetime, etc.).
2. Pandas have a simpler interface for operations like file loading, plotting, selection, joining, GROUP
BY, which come very handy in data-processing applications.
3. Pandas DataFrames (with column names) make it very easy to keep track of data.
4. Pandas is used when data is in Tabular Format, whereas Numpy is used for numeric array based data
manipulation.
A data structure is a collection of data values and operations that can be applied to that data. It enables
efficient storage, retrieval and modification to the data.
Series
A Series is a one-dimensional array containing a sequence of values of any data type (int, float, list,
string, etc) which by default have numeric data labels starting from zero. The data label associated with a
particular value is called its index.
Creation of series
(A) Creation of series form scalar values
COMPUTER
import pandas as pd
series1 = pd.Series([10,20,30])
print(series1)
Output:
0 10
1 20
2 30
print(series2)
Output:
3Kavi
5 Shyam
1Ravi
import numpy as np
import pandas as pd
array1 = np.array([1,2,3,4])
series3 = pd.Series(array1)
print(series3)
Output:
01
12
23
34
series8 = pd.Series(dict1)
print(series8)
Output:
India NewDelhi
UK London
Japan Tokyo
seriesNum = pd.Series([10,20,30])
seriesNum[2]
Output:
30
(B) Slicing
Output:
USA WashingtonDC
UK London
COMPUTER
Attributes of Series
Methods of Series
seriesTenTwenty=pd.Series(np.arange( 10, 20, 1 ))
print(seriesTenTwenty)
Output:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
COMPUTER
seriesA + seriesB
Output:
a -9.0
b NaN
c -47.0
d NaN
e 105.0
y NaN
z NaN
The second method is applied when we do not want to have NaN values in the output. We can use the
series method add() and a parameter fill_value to replace missing value with a specified value.
seriesA.add(seriesB, fill_value=0)
Output:
a -9.0
COMPUTER
b 2.0
c -47.0
d 4.0
e 105.0
y 20.0
z 10.0
seriesA – seriesB
Output:
a 11.0
b NaN
c 53.0
d NaN
e -95.0
y NaN
Let us now replace the missing values with 1000 before subtracting seriesB from seriesA using explicit
subtraction method sub().
seriesA.sub(seriesB, fill_value=1000)
a 11.0
b -998.0
c 53.0
d -996.0
e -95.0
y 980.0
z 990.0
seriesA * seriesB
Output:
a -10.0
b NaN
c -150.0
d NaN
e 500.0
y NaN
Let us now replace the missing values with 0 before multiplication of seriesB with seriesA using explicit
multiplication method mul().
seriesA.mul(seriesB, fill_value=0)
a -10.0
b 0.0
c -150.0
d 0.0
e 500.0
y 0.0 z 0.0
seriesA/seriesB
Output:
a -0.10
b NaN
c -0.06
COMPUTER
d NaN
Dataframe
A DataFrame is a two-dimensional labelled data structure like a table of MySQL. It contains rows and
columns, and therefore has both a row and column index. Each column can have a different type of value
such as numeric, string, boolean, etc., as in tables of a database
Creation of a Dataframe
(A) Creation of Data Frame from NumPy ndarrays
import numpy as np
array1 = np.array([10,20,30])
array2 = np.array([100,200,300])
dFrame5
Output:
A B C D
0 10 20 30 NaN
dFrameListDict = pd.DataFrame(listDict)
Output:
a b c
0 10 20 NaN
1 5 10 20.0
COMPUTER
dictForest = {'State': ['Assam', 'Delhi', 'Kerala'], 'GArea': [78438, 1483, 38852], 'VDF' : [2797, 6.72,1663]}
dFrameForest= pd.DataFrame(dictForest)
dFrameForest
Output:
dFrameForest1
Output:
dFrame6 = pd.DataFrame(seriesA)
dFrame6
COMPUTER
Output:
a1
b2
c3
d4
e5
dFrame7
Output:
a b c d e
0 1 2 3 4 5
ResultDF = pd.DataFrame(ResultSheet)
ResultDF
Output:
Maths 90 92 89 81 94
Science 91 81 91 71 95
Hindi 97 96 88 67 99
COMPUTER
ResultDF['Preeti']=[89,78,76]
ResultDF
Output:
Maths 90 92 89 81 94 89
Science 91 81 91 71 95 78
Hindi 97 96 88 67 99 76
Output:
Maths 90 92 89 81 94 89
Science 91 81 91 71 95 78
Hindi 97 96 88 67 99 76
English 85 86 83 80 90 89
To delete a row, the parameter axis is assigned the value 0 and for deleting a column, the parameter axis
is assigned the value 1
Output:
Sub1 90 92 89 81 94
Sub2 91 81 91 71 95
Sub3 97 96 88 67 99
Sub4 97 89 78 60 45
The parameter axis='index' is used to specify that the row label is to be changed.
ResultDF=ResultDF.rename({'Arnab':'Student1','Ramit':'Student2','
Samridhi':'Student3','Mallika':'Student4'}, axis='column’)
Output:
Maths 90 92 89 81 94
Science 91 81 91 71 95
English 97 96 88 67 99
Hindi 97 89 78 60 45
ResultDF
Maths 90 92 89 81 94
Science 91 81 91 71 95
Hindi 97 96 88 67 99
ResultDF.loc['Science']
Output:
Arnab 91
Ramit 81
Samridhi 91
Riya 71
Mallika 95
COMPUTER
dFrame10Multiples = pd.DataFrame([10,20,30,40,50])
dFrame10Multiples.loc[2]
Output:
0 30
ResultDF.loc[:,'Arnab']
Output:
Maths 90
Science 91
Hindi 97
ResultDF.loc['Maths'] > 90
Output:
Arnab False
Ramit True
Samridhi False
Riya False
Mallika True
ResultDF.loc[:,‘Arnab’] > 90
Output:
Maths False
Science True
Hindi True
COMPUTER
Output:
Maths 90 92 89 81 94
Science 91 81 91 71 95
Output:
Maths 90
Science 91
Output:
Maths 90 92 89
Science 91 81 91
Output:
Arnab Samridhi
Maths 90 89
Science 91 91
Importing and Exporting Data between CSV Files and Data Frames
Importing a CSV file to a Data frame
COMPUTER
#
COMPUTER