0% found this document useful (0 votes)
5 views44 pages

PythonDASE_2025 Version1 (1)

The document discusses the integration of data analytics and machine learning into software engineering, highlighting their benefits such as improved decision-making, optimized efficiency, and enhanced user experience. It also covers essential Python libraries for data analytics, including NumPy, Pandas, and Matplotlib, along with practical examples of data visualization techniques. Additionally, it introduces social network analysis using the NetworkX library, emphasizing its applications in understanding relationships and structures within data.

Uploaded by

nirthisingh58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views44 pages

PythonDASE_2025 Version1 (1)

The document discusses the integration of data analytics and machine learning into software engineering, highlighting their benefits such as improved decision-making, optimized efficiency, and enhanced user experience. It also covers essential Python libraries for data analytics, including NumPy, Pandas, and Matplotlib, along with practical examples of data visualization techniques. Additionally, it introduces social network analysis using the NetworkX library, emphasizing its applications in understanding relationships and structures within data.

Uploaded by

nirthisingh58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA ANALYTICS WITH PYTHON

HTTPS://MORIOH.COM/P/0BC57432AB32

Mark Twain said that the secret of getting ahead is getting started.
Programming can seem daunting for beginners, but the best way to
get started is to dive right in and start writing code.
DATA ANALYTICS IN SE

• The integration of data analytics into software engineering practices is paving the way for the future of this field.
• Data analytics refers to the process of examining large sets of data to uncover patterns, correlations, and insights that can be used to
make data-driven decisions. Incorporating data analytics into software engineering processes provides numerous advantages and
benefits.
• Improved decision-making: By analyzing data, software engineers can make more informed decisions based on factual evidence
rather than relying on intuition alone. This leads to better software development strategies and more accurate predictions of user
needs.
• Optimized efficiency: Data analytics helps identify bottlenecks and inefficiencies in the software development lifecycle, allowing
engineers to optimize processes, reduce errors, and deliver high-quality software in a shorter time frame.
• Enhanced user experience: By analyzing user data, software engineers can gain insights into user behavior, preferences, and pain
points. This information can be used to improve user interfaces, personalize experiences, and create tailored solutions.
• Early detection of issues: Data analytics enables software engineers to proactively identify and address potential issues before they
escalate. By monitoring and analyzing software performance metrics, engineers can spot anomalies and implement preventive
measures.
DA AND ML

• One of the key drivers behind the future of software engineering is machine learning. Machine
learning algorithms can analyze vast amounts of data, recognize patterns, and make accurate
predictions or decisions based on this information. Incorporating machine learning into software
engineering processes brings numerous benefits.
• Automated testing: Machine learning algorithms can be trained to analyze code and
automatically detect bugs, reducing the need for manual testing and increasing efficiency.
• Intelligent debugging: By leveraging machine learning, software engineers can build models
that identify recurring errors patterns and suggest potential fixes, making the debugging process
faster and more effective.
• Code optimization: Machine learning algorithms can analyze code repositories and identify
patterns of efficient code, helping software engineers optimize their codebase for improved
performance.
• Personalized software: Machine learning can analyze user data and preferences to create
personalized software experiences, tailored to individual needs.
DA AND SOFTWARE DEV

• What is Data Analytics in Software Development?


• Data analytics in software development involves the collection, analysis, and interpretation of data generated
during the development process. This data can come from various sources, such as user feedback, error logs,
performance metrics, and customer usage patterns. By leveraging data analytics, developers can identify trends,
detect potential issues, and optimize their software development processes.
• Leverage User Feedback: Collect and analyze feedback from users to identify common issues and areas for
improvement.
• Monitor Error Logs: Continuously monitor error logs to detect and resolve software defects, ensuring better
software quality.
• Utilize Performance Metrics: Track performance metrics, such as response times and throughput, to identify
bottlenecks and optimize software performance.
• Explore Customer Usage Patterns: Analyze customer usage patterns to understand how users interact with
the software and gain insights for user-centric design and development.
• Implement Continuous Integration and Delivery: Automate software testing and deployment processes
to ensure quicker feedback loops and faster software delivery.
PYTHON LIBRARIES AND PACKAGES FOR DATA SCIENTISTS
(THE 5 MOST IMPORTANT ONES)

• Numpy
• NumPy is a Python package. It stands for ‘Numerical Python’. It is a library consisting of multidimensional
array objects and a collection of routines for processing of array.
• Pandas - Python’s popular data analysis library, pandas, provides several different options for visualizing
your data with .plot(). Even if you’re at the beginning of your pandas journey, you’ll soon be creating basic
plots that will yield valuable insights into your data.
• Arranges data into a 2-d table – similar to a database; however, the Pandas library calls it a dataframe
(df as an abbreviation-to plot – df.plot()); It was created for data analysis, data cleaning, data handling and
data discovery…this is one of the most common libraries used for data analytics and data
visualisations…as well as machine learning algorithms…open source community intended for pandas to
be the most powerful library for data analytics!.
• import pandas as pd
• Matplotlib
• This is another popular data visualisation library…Data visualization helps you to better understand your
data, discover things that you wouldn’t discover in raw format and communicate your findings more
efficiently to others…Matplotlib is the most famous and commonly used plotting library in Python
• import matplotlib.pyplot as plt
USING MATPLOTLIB

• A Quick plot
• from matplotlib import pyplot as plt
• x=[1,2,3,4]
• y=[1,4,9,16]
• plt.plot(x,y)
• #plt.show() in VS
DOING THE SAME THING WITH PANDA

• import pandas as pd
• x=[1,2,3,4]
• y=[1,4,9,16]
• df=pd.DataFrame(x,y)
• df.plot() #plt.show() in VS -
So what happened?

The DataFrame has an automatic index column that is used to


plot against the first series – which is the x values!!
OR – YOU CAN GIVE NAMES TO THE
DATA SERIES
• import matplotlib.pyplot as plt
• import pandas as pd
• x1=[1,2,3,4]
• y1=[1,4,9,16]
• df=pd.DataFrame({'x_series':x1,'y_series':y1})
• df.plot(kind='line',x='x_series',y='y_series' ,color='red')
• plt.show()
MATPLOTLIB

• A graph or chart is simply a visual representation of


numeric data. MatPlotLib makes a large number of
graph and chart types available to you.
SIMPLE LINES WITH MATPLOTLIB

• import matplotlib.pyplot as plt


• x1=[1,2,3,4]
• y1=[1,4,9,16]
• x2 = [1,2,3, 4]
• y2 = [4,7,10,13]
• # plotting the line 2 points
• plt.plot(x1, y1, label = "line 1")
• plt.plot(x2, y2, label = "line 2")
• plt.legend()
• plt.show()
ON GOOGLE COLABS

• https://colab.research.google.com/notebooks/welcome.ipynb#scrollTo=JKYK1Qh-b66n
PLOTTING OPTIONS

• # x axis values
• x = [1,2,3,4,5,6]
• # corresponding y axis values
• y = [2,4,1,5,2,6]
• # plotting the points
• plt.ylim(1,8)
• plt.xlim(1,8)
• plt.plot(x, y, color='green', linestyle='dashed', linewidth = 3,
• marker='o', markerfacecolor='blue', markersize=12)
• # setting x and y axis range
BAR PLOT

• x = [1, 2, 3, 4, 5]
• y = [10, 24, 36, 40, 5]
• plt.xlabel('Entries')
• plt.ylabel('Values')
• plt.bar(x,y, width = 0.8, color = ['red',
'green','blue'])
PIE PLOT

The colors parameter lets you choose custom


• import matplotlib.pyplot as plt colors for each pie wedge.You use the labels
• values = [5, 8, 9, 10, 4, 7] parameter to identify each wedge. In many cases,
you need to make one wedge stand out from the
• colors = ['b', 'g', 'r', 'c', 'm', 'y'] others, so you add the explode parameter with list
• labels = ['A', 'B', 'C', 'D', 'E', 'F'] of explode values. A value of 0 keeps the wedge in
place — any other value moves the wedge out
• explode = (0, 0.2, 0, 0, 0, 0) from the center of the pie.
• plt.pie(values, colors=colors, labels=labels,
• explode=explode, autopct='%1.1f%%',
• counterclock=False, shadow=True)
• plt.title('Values')
• plt.show()
HISTOGRAMS

• Histograms categorize data by breaking it into bins, where each bin


contains a subset of the data range.
• A histogram then displays the number of items in each bin so that
you can see the distribution of data and the progression of data from
bin to bin.
• In most cases, you see a curve of some type, such as a bell curve.
• The problem with doing a demo for histograms is the lack of
data…so you need to generate random data and then the example
that follows shows how to create a histogram with randomized data
NUMPY

• NumPy offers the random module to work with random numbers


• Generate a random integer from 0 to 100:
• import numpy as np
• x = np.random.randint(100)
• print(x)
• In NumPy we work with arrays, and you can use the two methods from the above
examples to make random arrays
• Generate a 1-D array containing 5 random integers from 0 to 100:
• import numpy as np
• x=np.random.randint(100, size=(5))
• print(x)
RANDN

• The np.random.randn() is a numpy library method


that returns a sample (or samples) from the
“standard normal” distribution.
• Randn assumes 0 as the central point
• All data generated have a
greater probability of being close
to the central point (0)
A HISTOGRAM WITH STANDARD DATA

• import numpy as np
• import matplotlib.pyplot as plt
• x = 20*np.random.randn(10000)
• plt.hist(x, 25, range=(-60, 60), histtype='stepfilled',
• align='mid', color='g', label='Test Data')
• plt.legend()
• plt.title('Step Filled Histogram')
• plt.show()
DATA MINING - NETWORK
RELATIONSHIPS
• A Network or Graph is a special representation of entities which have relationships among
themselves.
• It is made up of a collection of two generic objects — (1) node: which represents an entity, and
(2) edge: which represents the connection between any two nodes. In a complex network, we
also have attributes or features associated with each node and edge. F
• For example, a person represented as a node may have attributes like age, gender, salary, etc.
Similarly, an edge between two persons which represents ‘friend’ connection may have attributes
like friends_since, last_meeting, etc.
• Because of this complex nature, it becomes imperative that we present a network intuitively, such
that it showcases as much information as possible.
• To do so we first need to get acquainted with the different available tools, and that’s the topic of
this article i.e. to go through the different options which help us visualize a network
A GRAPH WITH NODES

• import matplotlib.pyplot as plt


• import networkx as nx
• G=nx.Graph()
• G.add_node("Sanjay")
• G.add_node("Deepak")
• G.add_node("Mpho")
• G.add_node("Sue")
• nx.draw(G, with_labels=True)
• plt.show()
NODES AND EDGES

• Nodes represent the individuals in a network,


while edges constitute the relationships between the
individuals.
NODES AND EDGES

• Networks are described by two sets of items, import matplotlib.pyplot as plt


which form a “network”. import networkx as nx
• Nodes G=nx.Graph()
G.add_node("Sanjay")
• Edges G.add_node("Deepak")
• In mathematical terms, this is a graph. G.add_node("Mpho")
• Edges can be added separately or as a list G.add_node("Sue")
G.add_edge("Sanjay", "Deepak")
G.add_edge("Sanjay", "Mpho")
G.add_edge("Deepak", "Sue")
G.add_edge("Mpho", "Sue")
G.add_edge("Sanjay", "Sue")
G.add_edge("Deepak", "Mpho")
nx.draw(G, with_labels=True,
node_color="red",node_size=2000)
plt.show()
SOCIAL NETWORK ANALYSIS

• Link analysis refers to the process of analyzing the links (or relationships) between
any kind of entity. This can include analyzing links between web pages, emails, financial
transactions, or any other data type where relationships between entities are
relevant.
• Social network analysis is a specific kind of link analysis that focuses exclusively on
people and groups and their relationships with each other.
• One aspect that makes SNA such a powerful tool is the ability to visualize these
relationships in a graph, using nodes to represent individuals and edges to represent
the connections between them.
• Visualizing individuals and relationships allows us to more easily intuit the dynamics of
social influence, the formation of social groups, and the flow of information between
groups and individuals.
NETWORKX

• NetworkX is a Python library for the creation, manipulation,


and study of complex networks.
• It can handle networks with millions of nodes and edges, and
provides functions for generating random networks, calculating
network metrics, and visualizing network structures.
• It also has a wide range of algorithms for community
detection, link prediction, and network visualization.
• While Networkx has extensive capabilities, Python users will
find it user-friendly and intuitive to use.
NETWORK ANALYSIS USING
NETWORKX

• NetworkX has its own drawing module which


provides multiple options for plotting.
• Below we can find the visualization for some of the
draw modules in the package.
• Using any of them is fairly easy, as all you need to do
is call the module and pass the G graph variable and
the package does the rest.
A GRAPH OF NODES & EDGES

• Consider that this graph represents the places in a city that people generally
visit, and the path that was followed by a visitor of that city. Let us consider V
as the places and E as the path to travel from one place to another.

V = {v1, v2, v3, v4, v5}

E = {(v1,v2), (v2,v5), (v5, v5), (v4,v5), (v4,v4)}


The edge (u,v) is the same as the edge (v,u) – They
are unordered pairs.

Concretely – Graphs are mathematical structures


used to study pairwise relationships between objects
and entities. It is a branch of Discrete Mathematics
and has found multiple applications in Computer
Science, Chemistry, Linguistics, Operations Research,
Sociology etc.
A NETWORK OF ASSOSCIATIONS

• import pandas as pd
• import matplotlib.pyplot as plt
• import networkx as nx
• G = nx.Graph()
• df = pd.DataFrame([
• ("Dave", "Ntando"), ("Peter", "Shalulile"), ("John", "Jenny"),
• ("Mohamad", "Jabulani"), ("Dave", "John"), ("Peter",
"Sameera"),
• ("Sameera", "Albert"), ("Peter", "John"),("Peter", "Jabulani")
• ], columns=['from', 'to'])
• G = nx.from_pandas_edgelist(df, 'from', 'to')
• nx.draw(G, with_labels=True,
node_color="red",node_size=2000)
• plt.show()
SOCIAL NETWORK ANALYSIS (SNA).

• The study of social structures using graph theory is


called social network analysis (SNA).
USING AN EXTERNAL FILE FOR SNA

• import os
• os.chdir("c:\\SE_2025\\Python\\Data")
• import pandas as pd
• import matplotlib.pyplot as plt
• import networkx as nx
• df = df = pd.read_csv("us_president.csv")
• df.head
• G = nx.from_pandas_edgelist(df, 'From', 'To')
• nx.draw(G, with_labels=True,
node_color='lightblue')
• plt.show()
SNA ANALYSIS

• Analysis
• A node's degree is simply a count of
how many social connections (i.e.,
edges) it has. The degree centrality for
a node is simply its degree. A node
with 10 social connections would have
a degree centrality of 10.
• Degree of Centrality
• nx.degree(G)
• Most Influential
• nx.degree_centrality(G)
• …if you wanted a sorted list most_influential = nx.degree_centrality(G)
for w in sorted(most_influential, key=most_influential.get,
reverse=True):
print(w, most_influential[w])
WE ALL NEED IMPORTANT
CONNECTIONS
• Most important connection
• Eigenvector centrality measures a node's
importance based on the importance of its
neighbors in a network.A node connected to
many influential nodes will have a higher
eigenvector centrality score than a node connected
to many less influential nodes, even if the number of
connections is the same…. In essence, it's not just most_influential = nx.degree_centrality(G)
about the quantity of connections, but also the quality for w in sorted(most_influential,
or influence of those connections. key=most_influential.get, reverse=True):
print(w, most_influential[w])
• most_influential = nx.degree_centrality(G)
• If you wanted a sorted version
SHORTEST PATH

• Shortest path analysis can reveal the overall connectivity


of a network and identify hubs or bridges that connect
different parts of the network.
• Imagine a social network where individuals are nodes and
friendships are edges.The shortest path between two
friends might be a direct friendship (one hop) or a
friendship through a mutual acquaintance (two hops).
• E.g. nx.shortest_path(G,"G Bush","M Trump")
PLAYING WITH SCIKIT-LEARN

• Scikit-learn is the package for machine learning and data science experimentation
favored by most data scientists. It contains a wide range of well-established learning
algorithms, error functions, and testing procedures.
• Data Science
• Classification problem: Guessing that a new observation is from a certain group
• Regression problem: Guessing the value of a new observation
• It works with the method fit(X, y) where X is the bidimensional array of predictors (the set
of observations to learn) and y is the target outcome (another array, unidimensional).
• The next slide shows an example of simple linear regression – the data is read from
the csv file named Salary (found in Learn); the data file contains information of
employees’ years of experience and their salary amounts
GETTING AN INITIAL VIEW OF THE
DATA

• import os
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• os.chdir("c:\\SE_2025\\Python\\Data")
• df = pd.read_csv('Salary.csv')
• df.head(10)
THE SCATTER PLOT OF SALARY DATA

• import os
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• os.chdir("c:\\SE_2025\\Python\\Data")
• df = pd.read_csv('Salary.csv')
• df.head(10)
• x=df["YearsExperience"]
• y=df["Salary"]
• plt.scatter(x, y, s=[100], marker='*', c='m')
• plt.xlabel("Years of Experience")
• plt.ylabel("Salary")
• plt.show()
WITH THE
PREDICTION
FEATURE
• from sklearn import linear_model
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• regr = linear_model.LinearRegression()
• df = pd.read_csv('Salary.csv')
• x=df[["YearsExperience"]] # a list of predictors is expected here - we only have 1 this time
• y=df["Salary"]
• regr.fit(x,y)
• years=int(input("Enter number of years of experience"))
• PredSalary = regr.predict([[years]]) # e,g years = 20 (Years of Experience)
• PredSalary=round(PredSalary[0],2) # since this actually a list…we need the number
• print("For an Employee with ",years," of experience the predicted salary is ",PredSalary)
• plt.scatter(x, y, s=[100], marker='*', c='m') # plot the graph with original data
• plt.plot(x,regr.predict(x)) # add on the regression line
• plt.show()
THE NEXT FEW SECTIONS

• Here the concept of Machine Learning (ML), Artifical


Intelligence (AI) and Big Data (BD) and Data Science
will be presented in an overview manner
DATA MINING

• Data mining is described as “statistics at scale and speed”


• Data mining stands at the confluence of the fields of statistics and
machine learning (also known as artificial intelligence).
• A variety of techniques for exploring data and building models have
been around for a long time in the world of statistics:
• linear regression, logistic regression, discriminant analysis, and principal
components analysis, for example.
• But the core tenets of classical statistics—computing is difficult and
data are scarce—do not apply in data mining applications where both
data and computing power are plentiful
DM & BD

• Data mining and Big Data go hand in hand. Big Data is a relative term—data today are
big by reference to the past, and to the methods and devices available to deal with
them.
• The challenge Big Data presents is often characterized by the four Vs—
• volume, velocity, variety, and veracity.Volume refers to the amount of data.
• Velocity refers to the flow rate—the speed at which it is being generated and changed.
• Variety refers to the different types of data being generated (currency, dates, numbers, text,
etc.).
• Veracity refers to the fact that data is being generated by organic distributed processes
(e.g., millions of people signing up for services or free downloads) and not subject to the
controls or quality checks that apply to data collected for a study.
DATA SCIENCE

• The ubiquity, size, value, and importance of Big Data has given rise to
a new profession: the data scientist.
• Data science is a mix of skills in the areas of statistics, machine
learning, math, programming, business, and IT.
• The term itself is a reference to a rare individual who combines deep
skills in all the constituent areas.
• In their book Analyzing the Analyzers (Harris et al., 2013), the
authors describe the skill sets of most data scientists as resembling a
“T”—deep in one area (the vertical bar of the T), and shallower in
other areas (the top of the T).
STATS & MACHINE LEARNING

• A major difference between the fields of statistics and machine


learning is the focus in statistics on inference from a sample to the
population regarding an “average effect”—
• for example, “a R10 price increase will reduce average demand by 2 boxes.”
• In contrast, the focus in machine learning is on predicting individual
records—“the predicted demand for person i given a R10 price increase is
1 box, while for person j it is 3 boxes.”
• The emphasis that classical statistics places on inference (determining
whether a pattern or interesting result might have happened by chance in
our sample) is absent from data mining.
ML

• Machine learning to refer to algorithms that learn directly from data,


especially local patterns, often in layered or iterative fashion.
• In contrast, we use statistical models to refer to methods that apply
global structure to the data.
• A simple example is a linear regression model (statistical) vs. a k-
nearest-neighbors algorithm (machine learning).
• A given record would be treated by linear regression in accord with an
overall linear equation that applies to all the records.
• In k-nearest neighbours, that record would be classified in accord with the
values of a small number of nearby records.
• Big data is known as the process in which we collect and analyze the large volume
of data sets (called Big Data) which helps in discovering useful hidden patterns and
other information such as customer choices, market trends which is really beneficial
for the organizations to remain informed and customer-oriented business decision
• Machine learning is a subset of AI ( Artificial intelligence) which helps the
computers and machines in predicting future actions without the intervention of
human beings. So it could be said that, with the help of Machine learning, software
applications can learn how to make their accuracy better in order to predict the
outcomes.
• The normal procedure of big data analytics is just all about gathering and transform
the particular data into extract information, and after that, that particular gathered
data is used by Machin Learning in order to predict better results.
BDA & ML

• Big data provides a massive amount of data; volume & variety


• Machine learning is a subset of AI – it only works well when it can control/manage its data---if there is too much
data, ML algorithms will struggle to provide an accurate output
• The data has to be reduced ---this means it needs to be summarized, extrapolated and it has to be represented
by statistical measures of aggregation (e.g. sum, average, median, mode, standard deviation, regression line,
correlation co-efficients,…)
• In this way masses of data are reduced to single values through Big Data Analytics (BDA)
• At this point, Machine learning gets involved and leverages the BDA values to provide additional insights into the
data…and make predictions and creates on the data that it has ….this is referred to as supervised learning
where training data is used to make future predictions as is the case for linear regression.
• In unsupervised learning, the big data is used as a starting point blank slate with no rules or prior patterns
provided); the machine is responsible for identifying patterns and associations…Clustering is an important
concept when it comes to unsupervised learning. It mainly deals with finding a structure or pattern in a collection
of uncategorized data.

You might also like