DA Unit 4
DA Unit 4
1
Data Analytics Unit-4
UNIT-4: Exploratory Data Analysis (10 Periods)
Data Visualization, Reading and getting data (External Data): Using CSV
files, XML files, Web Data, JSON files, Databases, Excel files. Charts
and Graphs: Histograms, Boxplots, Bar Charts, Line Graphs, Scatter
plots, Pie Charts
2
Flow of Presentation
• Data Visualization,
• Reading and getting data (External Data):
Using CSV files,
• XML files,
• Web Data,
• JSON files, Databases, Excel files.
• Charts and Graphs: Histograms, Boxplots, Bar
Charts, Line Graphs, Scatter plots, Pie Charts
3
4
5
6
What Are Data Visualization Tools?
• Some of the best data visualization tools include Google Charts,
Tableau, Grafana, Chartist, FusionCharts, Datawrapper, Infogram, and
ChartBlocks etc. These tools support a variety of visual styles, be
simple and easy to use, and be capable of handling a large volume of
data.
• Modern data visualisation tools and advanced software are on the
market. A data visualisation tool is a software that is used to visualise
data. The features of each tool vary, but at their most basic, they
allow you to input a dataset and graphically alter it. Most, but not all,
come with pre-built templates for creating simple visualisations.
7
Reading and getting data
• Using CSV files
A CSV (Comma Separated Values) file is a form of plain text document
which uses a particular format to organize tabular information. CSV
file format is a bounded text document that uses a comma to
distinguish the values. Every row in the document is a data log. Each
log is composed of one or more fields, divided by commas. It is the
most popular file format for importing and exporting spreadsheets
and databases.
Reading a CSV File
• There are various ways to read a CSV file that uses either the CSV
module or the pandas library.
8
Reading and getting data
USing csv.reader():
import csv
# opening the CSV file
with open('Giants.csv', mode ='r')as file:
# reading the CSV file
csvFile = csv.reader(file)
# displaying the contents of the CSV file
for lines in csvFile:
print(lines)
9
Reading and getting data
Using pandas.read_csv() method:
import pandas
# reading the CSV file
csvFile = pandas.read_csv('Giants.csv')
# displaying the contents of the CSV file
print(csvFile)
10
Reading and Writing XML Files in Python
• Extensible Markup Language, commonly known as XML is a language
designed specifically to be easy to interpret by both humans and
computers altogether. The language defines a set of rules used to
encode a document in a specific format. In this article, methods have
been described to read and write XML files in python.
• Note: In general, the process of reading the data from an XML file and
analyzing its logical components is known as Parsing. Therefore, when
we refer to reading a xml file we are referring to parsing the XML
document.
• Two libraries that could be used for the purpose of xml parsing are:
• BeautifulSoup used alongside the lxml parser
• Elementtree library.
11
Using BeautifulSoup alongside with lxml parser
• Beautiful Soup supports the HTML parser included in Python’s
standard library, but it also supports a number of third-party Python
parsers. One is the lxml parser (used for parsing XML/HTML
documents).
pip install beautifulsoup4
pip install lxml
12
from bs4 import BeautifulSoup
# Reading the data inside the xml
# file to a variable under the name
# data
with open('dict.xml', 'r') as f:
data = f.read()
# Passing the stored data inside
# the beautifulsoup parser, storing
# the returned object
Bs_data = BeautifulSoup(data, "xml")
# Finding all instances of tag
# `unique`
b_unique = Bs_data.find_all('unique')
print(b_unique)
# Using find() to extract attributes
# of the first instance of the tag
b_name = Bs_data.find('child', {'name':'Frank'})
print(b_name)
# Extracting the data stored in a
# specific attribute of the
# `child` tag
value = b_name.get('test')
print(value) 13
OUTPUT:
14
Writing an XML File
• Writing a xml file is a primitive process, reason for that being the fact
that xml files aren’t encoded in a special way. Modifying sections of a
xml document requires one to parse through it at first. In the below
code we would modify some sections of the aforementioned xml
document.
15
from bs4 import BeautifulSoup
17
# imported the requests library
import requests
image_url = "https://www.python.org/static/community_logos/python-
logo-master-v3-TM.png"
18
Download large files
• The HTTP response content (r.content) is nothing but a string which is
storing the file data. So, it won’t be possible to save all the data in a
single string in case of large files. To overcome this problem, we do
some changes to our program:
• Since all file data can’t be stored by a single string, we
use r.iter_content method to load data in chunks, specifying the
chunk size. r = requests.get(URL, stream = True)
Setting stream parameter to True will cause the download of
response headers only and the connection remains open. This avoids
reading the content all at once into memory for large responses. A
fixed chunk will be loaded each time while r.iter_content is iterated.
19
import requests
file_url = "http://codex.cs.yale.edu/avi/db-book/db4/slide-dir/ch1-
2.pdf"
20
Reading and getting JSON files
• JSON (JavaScript Object Notation) is a popular data format used for
representing structured data. It's common to transmit and receive
data between a server and web application in JSON format.
21
Import json Module
To work with JSON (string, or file containing JSON object), you can use Python's json module. You need to import the module before you can use it.
Import json
import json
22
Python read JSON file
{"name": "Bob",
"languages": ["English", "French"]
}
import json
23
Writing JSON to a file
import json
24
# Python program to read
# json file
import json
Output:
# Opening JSON file
f = open('data.json')
# Closing file
f.close()
25
Reading and getting databases
How to Connect to a SQL Database using Python
• Python has several libraries for connecting to SQL databases,
including pymysql, psycopg2, and sqlite3.
• First, we need to install the pymysql library using pip:
pip install pymysql
import the pymysql library and connect to the MySQL database using
the following code:
import pymysql conn = pymysql.connect( host='localhost', user='root',
password='password', db='mydatabase', charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor )
26
In the code above, we first import the pymysql library. Then, we use the connect() function to establish a connection to
the MySQL database.
27
How to Insert Data into a SQL Database using Python How to Update Data in a SQL Database using Python
try:
with conn.cursor() as cursor:
# Create a new record try:
sql = "INSERT INTO `users` (`email`, `password`) with conn.cursor() as cursor:
VALUES (%s, %s)" # Update a record
cursor.execute(sql, ('[email protected]', sql = "UPDATE `users` SET `password`=%s WHERE
'mypassword')) `email`=%s"
cursor.execute(sql, ('newpassword',
# Commit changes '[email protected]'))
conn.commit()
# Commit changes
print("Record inserted successfully") conn.commit()
finally:
conn.close() print("Record updated successfully")
finally:
conn.close()
28
How to Delete Data from a SQL Database using Python How to Read Data from a SQL Database using Python
try:
with conn.cursor() as cursor:
try:
# Delete a record
with conn.cursor() as cursor:
sql = "DELETE FROM `users` WHERE `email`=%s"
# Read data from database
cursor.execute(sql, ('[email protected]',))
sql = "SELECT * FROM `users`"
cursor.execute(sql)
# Commit changes
conn.commit()
# Fetch all rows
rows = cursor.fetchall()
print("Record deleted successfully")
finally:
# Print results
conn.close()
for row in rows:
print(row)
finally:
conn.close()
29
Reading and getting data from excel files
• An Excel spreadsheet document is called a workbook which is saved
in a file with .xlsx extension. The first row of the spreadsheet is
mainly reserved for the header, while the first column identifies the
sampling unit. Each workbook can contain multiple sheets that are
also called a worksheets. A box at a particular column and row is
called a cell, and each cell can include a number or text value. The
grid of cells with data forms a sheet.
• The active sheet is defined as a sheet in which the user is currently
viewing or last viewed before closing Excel.
30
Reading from an Excel file
First, you need to write a command to install the xlrd module.
• pip install xlrd
# Import the xlrd module
import xlrd
import openpyxl
Example -
# import pandas lib as pd # Define variable to load the dataframe
import pandas as pd dataframe = openpyxl.load_workbook("Book2.xlsx")
# read by default 1st sheet of an excel file # Define variable to read sheet
dataframe1 = pd.read_excel('book2.xlsx') dataframe1 = dataframe.active
print(dataframe1)
# Iterate the loop to read the cell values
for row in range(0, dataframe1.max_row):
for col in dataframe1.iter_cols(1, dataframe1.max_column):
print(col[row].value)
32
Reading an excel file using Python using Xlwings
# Specifying a sheet
ws = xw.Book("Book2.xlsx").sheets['Sheet1']
33
Histogram
• A histogram is basically used to represent data provided in a form of
some groups.It is accurate method for the graphical representation of
numerical data distribution.It is a type of bar plot where X-axis
represents the bin ranges while Y-axis gives information about frequency.
Creating a Histogram
• To create a histogram the first step is to create bin of the ranges, then
distribute the whole range of the values into a series of intervals, and
count the values which fall into each of the intervals.Bins are clearly
identified as consecutive, non-overlapping intervals of variables.The
matplotlib.pyplot.hist() function is used to compute and create
histogram of x.
34
from matplotlib import pyplot as plt
import numpy as np
# Creating dataset
a = np.array([22, 87, 5, 43, 56,
73, 55, 54, 11,
20, 51, 5, 79, 31,
27])
# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])
# Show plot
plt.show()
35
Box Plot
• A Box Plot is also known as Whisker plot is created to display the
summary of the set of data values having properties like minimum, first
quartile, median, third quartile and maximum. In the box plot, a box is
created from the first quartile to the third quartile, a vertical line is also
there which goes through the box at the median. Here x-axis denotes
the data to be plotted while the y-axis shows the frequency
distribution.
Creating Box Plot
• The matplotlib.pyplot module of matplotlib library provides boxplot()
function with the help of which we can create box plots.
Syntax:
• matplotlib.pyplot.boxplot(data, notch=None, vert=None,
patch_artist=None, widths=None)
36
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
37
Bar Plot
• A bar plot or bar chart is a graph that represents the category of data
with rectangular bars with lengths and heights that is proportional to
the values which they represent. The bar plots can be plotted
horizontally or vertically. A bar chart describes the comparisons
between the discrete categories. One of the axis of the plot
represents the specific categories being compared, while the other
axis represents the measured values corresponding to those
categories.
plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("Students enrolled in different courses")
plt.show()
39
Multiple bar plots Stacked bar plot Horizontal bar plot
40
Line chart
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
41
Scatter plots
42
import matplotlib.pyplot as plt
x =[5, 7, 8, 7, 2, 17, 2, 9,
4, 11, 12, 9, 6]
plt.scatter(x, y, c ="blue")
43
Pie Chart
• A Pie Chart is a circular statistical plot that can display only one series of
data. The area of the chart is the total percentage of the given data. The
area of slices of the pie represents the percentage of the parts of the data.
The slices of pie are called wedges. The area of the wedge is determined by
the length of the arc of the wedge. The area of a wedge represents the
relative percentage of that part with respect to whole data. Pie charts are
commonly used in business presentations like sales, operations, survey
results, resources, etc as they provide a quick summary.
44
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
# Creating dataset
cars = ['AUDI', 'BMW', 'FORD',
'TESLA', 'JAGUAR', 'MERCEDES']
# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(data, labels = cars)
# show plot
plt.show()
45
THANK YOU