PythonDASE_2025 Version1 (1)
PythonDASE_2025 Version1 (1)
HTTPS://MORIOH.COM/P/0BC57432AB32
Mark Twain said that the secret of getting ahead is getting started.
Programming can seem daunting for beginners, but the best way to
get started is to dive right in and start writing code.
DATA ANALYTICS IN SE
• The integration of data analytics into software engineering practices is paving the way for the future of this field.
• Data analytics refers to the process of examining large sets of data to uncover patterns, correlations, and insights that can be used to
make data-driven decisions. Incorporating data analytics into software engineering processes provides numerous advantages and
benefits.
• Improved decision-making: By analyzing data, software engineers can make more informed decisions based on factual evidence
rather than relying on intuition alone. This leads to better software development strategies and more accurate predictions of user
needs.
• Optimized efficiency: Data analytics helps identify bottlenecks and inefficiencies in the software development lifecycle, allowing
engineers to optimize processes, reduce errors, and deliver high-quality software in a shorter time frame.
• Enhanced user experience: By analyzing user data, software engineers can gain insights into user behavior, preferences, and pain
points. This information can be used to improve user interfaces, personalize experiences, and create tailored solutions.
• Early detection of issues: Data analytics enables software engineers to proactively identify and address potential issues before they
escalate. By monitoring and analyzing software performance metrics, engineers can spot anomalies and implement preventive
measures.
DA AND ML
• One of the key drivers behind the future of software engineering is machine learning. Machine
learning algorithms can analyze vast amounts of data, recognize patterns, and make accurate
predictions or decisions based on this information. Incorporating machine learning into software
engineering processes brings numerous benefits.
• Automated testing: Machine learning algorithms can be trained to analyze code and
automatically detect bugs, reducing the need for manual testing and increasing efficiency.
• Intelligent debugging: By leveraging machine learning, software engineers can build models
that identify recurring errors patterns and suggest potential fixes, making the debugging process
faster and more effective.
• Code optimization: Machine learning algorithms can analyze code repositories and identify
patterns of efficient code, helping software engineers optimize their codebase for improved
performance.
• Personalized software: Machine learning can analyze user data and preferences to create
personalized software experiences, tailored to individual needs.
DA AND SOFTWARE DEV
• Numpy
• NumPy is a Python package. It stands for ‘Numerical Python’. It is a library consisting of multidimensional
array objects and a collection of routines for processing of array.
• Pandas - Python’s popular data analysis library, pandas, provides several different options for visualizing
your data with .plot(). Even if you’re at the beginning of your pandas journey, you’ll soon be creating basic
plots that will yield valuable insights into your data.
• Arranges data into a 2-d table – similar to a database; however, the Pandas library calls it a dataframe
(df as an abbreviation-to plot – df.plot()); It was created for data analysis, data cleaning, data handling and
data discovery…this is one of the most common libraries used for data analytics and data
visualisations…as well as machine learning algorithms…open source community intended for pandas to
be the most powerful library for data analytics!.
• import pandas as pd
• Matplotlib
• This is another popular data visualisation library…Data visualization helps you to better understand your
data, discover things that you wouldn’t discover in raw format and communicate your findings more
efficiently to others…Matplotlib is the most famous and commonly used plotting library in Python
• import matplotlib.pyplot as plt
USING MATPLOTLIB
• A Quick plot
• from matplotlib import pyplot as plt
• x=[1,2,3,4]
• y=[1,4,9,16]
• plt.plot(x,y)
• #plt.show() in VS
DOING THE SAME THING WITH PANDA
• import pandas as pd
• x=[1,2,3,4]
• y=[1,4,9,16]
• df=pd.DataFrame(x,y)
• df.plot() #plt.show() in VS -
So what happened?
• https://colab.research.google.com/notebooks/welcome.ipynb#scrollTo=JKYK1Qh-b66n
PLOTTING OPTIONS
• # x axis values
• x = [1,2,3,4,5,6]
• # corresponding y axis values
• y = [2,4,1,5,2,6]
• # plotting the points
• plt.ylim(1,8)
• plt.xlim(1,8)
• plt.plot(x, y, color='green', linestyle='dashed', linewidth = 3,
• marker='o', markerfacecolor='blue', markersize=12)
• # setting x and y axis range
BAR PLOT
• x = [1, 2, 3, 4, 5]
• y = [10, 24, 36, 40, 5]
• plt.xlabel('Entries')
• plt.ylabel('Values')
• plt.bar(x,y, width = 0.8, color = ['red',
'green','blue'])
PIE PLOT
• import numpy as np
• import matplotlib.pyplot as plt
• x = 20*np.random.randn(10000)
• plt.hist(x, 25, range=(-60, 60), histtype='stepfilled',
• align='mid', color='g', label='Test Data')
• plt.legend()
• plt.title('Step Filled Histogram')
• plt.show()
DATA MINING - NETWORK
RELATIONSHIPS
• A Network or Graph is a special representation of entities which have relationships among
themselves.
• It is made up of a collection of two generic objects — (1) node: which represents an entity, and
(2) edge: which represents the connection between any two nodes. In a complex network, we
also have attributes or features associated with each node and edge. F
• For example, a person represented as a node may have attributes like age, gender, salary, etc.
Similarly, an edge between two persons which represents ‘friend’ connection may have attributes
like friends_since, last_meeting, etc.
• Because of this complex nature, it becomes imperative that we present a network intuitively, such
that it showcases as much information as possible.
• To do so we first need to get acquainted with the different available tools, and that’s the topic of
this article i.e. to go through the different options which help us visualize a network
A GRAPH WITH NODES
• Link analysis refers to the process of analyzing the links (or relationships) between
any kind of entity. This can include analyzing links between web pages, emails, financial
transactions, or any other data type where relationships between entities are
relevant.
• Social network analysis is a specific kind of link analysis that focuses exclusively on
people and groups and their relationships with each other.
• One aspect that makes SNA such a powerful tool is the ability to visualize these
relationships in a graph, using nodes to represent individuals and edges to represent
the connections between them.
• Visualizing individuals and relationships allows us to more easily intuit the dynamics of
social influence, the formation of social groups, and the flow of information between
groups and individuals.
NETWORKX
• Consider that this graph represents the places in a city that people generally
visit, and the path that was followed by a visitor of that city. Let us consider V
as the places and E as the path to travel from one place to another.
• import pandas as pd
• import matplotlib.pyplot as plt
• import networkx as nx
• G = nx.Graph()
• df = pd.DataFrame([
• ("Dave", "Ntando"), ("Peter", "Shalulile"), ("John", "Jenny"),
• ("Mohamad", "Jabulani"), ("Dave", "John"), ("Peter",
"Sameera"),
• ("Sameera", "Albert"), ("Peter", "John"),("Peter", "Jabulani")
• ], columns=['from', 'to'])
• G = nx.from_pandas_edgelist(df, 'from', 'to')
• nx.draw(G, with_labels=True,
node_color="red",node_size=2000)
• plt.show()
SOCIAL NETWORK ANALYSIS (SNA).
• import os
• os.chdir("c:\\SE_2025\\Python\\Data")
• import pandas as pd
• import matplotlib.pyplot as plt
• import networkx as nx
• df = df = pd.read_csv("us_president.csv")
• df.head
• G = nx.from_pandas_edgelist(df, 'From', 'To')
• nx.draw(G, with_labels=True,
node_color='lightblue')
• plt.show()
SNA ANALYSIS
• Analysis
• A node's degree is simply a count of
how many social connections (i.e.,
edges) it has. The degree centrality for
a node is simply its degree. A node
with 10 social connections would have
a degree centrality of 10.
• Degree of Centrality
• nx.degree(G)
• Most Influential
• nx.degree_centrality(G)
• …if you wanted a sorted list most_influential = nx.degree_centrality(G)
for w in sorted(most_influential, key=most_influential.get,
reverse=True):
print(w, most_influential[w])
WE ALL NEED IMPORTANT
CONNECTIONS
• Most important connection
• Eigenvector centrality measures a node's
importance based on the importance of its
neighbors in a network.A node connected to
many influential nodes will have a higher
eigenvector centrality score than a node connected
to many less influential nodes, even if the number of
connections is the same…. In essence, it's not just most_influential = nx.degree_centrality(G)
about the quantity of connections, but also the quality for w in sorted(most_influential,
or influence of those connections. key=most_influential.get, reverse=True):
print(w, most_influential[w])
• most_influential = nx.degree_centrality(G)
• If you wanted a sorted version
SHORTEST PATH
• Scikit-learn is the package for machine learning and data science experimentation
favored by most data scientists. It contains a wide range of well-established learning
algorithms, error functions, and testing procedures.
• Data Science
• Classification problem: Guessing that a new observation is from a certain group
• Regression problem: Guessing the value of a new observation
• It works with the method fit(X, y) where X is the bidimensional array of predictors (the set
of observations to learn) and y is the target outcome (another array, unidimensional).
• The next slide shows an example of simple linear regression – the data is read from
the csv file named Salary (found in Learn); the data file contains information of
employees’ years of experience and their salary amounts
GETTING AN INITIAL VIEW OF THE
DATA
• import os
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• os.chdir("c:\\SE_2025\\Python\\Data")
• df = pd.read_csv('Salary.csv')
• df.head(10)
THE SCATTER PLOT OF SALARY DATA
• import os
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• os.chdir("c:\\SE_2025\\Python\\Data")
• df = pd.read_csv('Salary.csv')
• df.head(10)
• x=df["YearsExperience"]
• y=df["Salary"]
• plt.scatter(x, y, s=[100], marker='*', c='m')
• plt.xlabel("Years of Experience")
• plt.ylabel("Salary")
• plt.show()
WITH THE
PREDICTION
FEATURE
• from sklearn import linear_model
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• regr = linear_model.LinearRegression()
• df = pd.read_csv('Salary.csv')
• x=df[["YearsExperience"]] # a list of predictors is expected here - we only have 1 this time
• y=df["Salary"]
• regr.fit(x,y)
• years=int(input("Enter number of years of experience"))
• PredSalary = regr.predict([[years]]) # e,g years = 20 (Years of Experience)
• PredSalary=round(PredSalary[0],2) # since this actually a list…we need the number
• print("For an Employee with ",years," of experience the predicted salary is ",PredSalary)
• plt.scatter(x, y, s=[100], marker='*', c='m') # plot the graph with original data
• plt.plot(x,regr.predict(x)) # add on the regression line
• plt.show()
THE NEXT FEW SECTIONS
• Data mining and Big Data go hand in hand. Big Data is a relative term—data today are
big by reference to the past, and to the methods and devices available to deal with
them.
• The challenge Big Data presents is often characterized by the four Vs—
• volume, velocity, variety, and veracity.Volume refers to the amount of data.
• Velocity refers to the flow rate—the speed at which it is being generated and changed.
• Variety refers to the different types of data being generated (currency, dates, numbers, text,
etc.).
• Veracity refers to the fact that data is being generated by organic distributed processes
(e.g., millions of people signing up for services or free downloads) and not subject to the
controls or quality checks that apply to data collected for a study.
DATA SCIENCE
• The ubiquity, size, value, and importance of Big Data has given rise to
a new profession: the data scientist.
• Data science is a mix of skills in the areas of statistics, machine
learning, math, programming, business, and IT.
• The term itself is a reference to a rare individual who combines deep
skills in all the constituent areas.
• In their book Analyzing the Analyzers (Harris et al., 2013), the
authors describe the skill sets of most data scientists as resembling a
“T”—deep in one area (the vertical bar of the T), and shallower in
other areas (the top of the T).
STATS & MACHINE LEARNING