0% found this document useful (0 votes)
62 views

Fundamentals of Data Science

Uploaded by

sksigmaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Fundamentals of Data Science

Uploaded by

sksigmaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

FUNDAMENTALS OF DATA SCIENCE

I UNIT:

Data Science: definition, Datafication, Exploratory Data Analysis, The Data


science process, A data scientist role in this process.

NumPy Basics: The NumPy ndarray: A Multidimensional Array Object, Creating


ndarrays ,Data Types for ndarrays, Operations between Arrays and Scalars, Basic
Indexing and Slicing, Boolean Indexing, Fancy Indexing, Data Processing Using
Arrays, Expressing Conditional Logic as Array Operations, Methods for Boolean
Arrays , Sorting , Unique

----------------------------------------------------------------------------------------------------

Data Science: is a branch of computer science where we study how to store, use
and analyze data for deriving information from it.

What is Data Science?

Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.

Therefore, data science is all about:

o Asking the correct questions and analyzing the raw data.


o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.

1
Example:

Let suppose we want to travel from station A to station B by car. Now, we need to
take some decisions such as which route will be the best route to reach faster at the
location, in which route there will be no traffic jam, and which will be cost-
effective. All these decision factors will act as input data, and we will get an
appropriate answer from these decisions, so this analysis of data is called the data
analysis, which is a part of data science.

Need for Data Science:

o With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.

2
o Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
o Data science is working for automating transportation such as creating a
self-driving car, which is the future of transportation.
o Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.

Data science Jobs:


Data Scientist
Data Analyst
Machine learning expert
Data engineer
Data Architect
Data Administrator
Business Analyst
Business Intelligence Manager

3
Data Science Components:

The main components of Data Science are given below:

1. Statistics: Statistics is one of the most important components of data science.


Statistics is a way to collect and analyze the numerical data in a large amount and
finding meaningful insights from it.

2. Domain Expertise: In data science, domain expertise binds data science


together.

3. Data engineering: Data engineering is a part of data science, which involves


acquiring, storing, retrieving, and transforming the data. Data engineering also
includes metadata (data about data) to the data.

4. Visualization: Data visualization is meant by representing data in a visual


context so that people can easily understand the significance of data. Data
visualization makes it easy to access the huge amount of data in visuals.

5. Advanced computing: Advanced computing involves designing, writing,


debugging, and maintaining the source code of computer programs.

4
6. Mathematics: Mathematics is the critical part of data science. Mathematics
involves the study of quantity, structure, space, and changes. For a data scientist,
knowledge of good mathematics is essential.

7. Machine learning: Machine learning is backbone of data science. Machine


learning is all about to provide training to a machine so that it can act as a human
brain. In data science, we use various machine learning algorithms to solve the
problems.

Tools for Data Science

Following are some tools required for data science:

Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB,


Excel, RapidMiner.

Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift

Data Visualization tools: R, Jupyter, Tableau, Cognos.

Machine learning tools: Spark, Mahout, Azure ML studio.

Data Science Lifecycle

The life-cycle of data science is explained as below diagram.

5
The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right
questions. In this phase, we need to determine all the requirements of the project
such as the number of people, technology, time, data, an end goal, and then we can
frame the business problem on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging(the process


of preparing raw data for reporting and analysis) . In this phase, we need to
perform the following tasks:

o Data cleaning
o Data Reduction
o Data integration
o Data transformation,

3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply
Exploratory data analytics(EDA) by using various statistical formula and

6
visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:

o SQL Analysis Services


o R
o SAS
o Python

4. Model-building: In this phase, the process of model building starts. We will


create datasets for training and testing purpose. We will apply different techniques
such as association, classification, and clustering, to build the model.

Following are some common Model building tools:

o SAS Enterprise Miner


o WEKA
o SPCS Modeler
o MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the project,
along with briefings, code, and technical documents. This phase provides you a
clear overview of complete project performance and other components on a small
scale before the full deployment.

6. Communicate results: In this phase, we will check if we reach the goal, which
we have set on the initial phase. We will communicate the findings and final result
with the business team.

Applications of Data Science:

o Image recognition and speech recognition:


Data science is currently using for Image and speech recognition. When you

7
upload an image on Facebook and start getting the suggestion to tag to your
friends. This automatic tagging suggestion uses image recognition algorithm,
which is part of data science.
When you say something using, "Ok Google", etc., and these devices respond
as per voice control, so this is possible with speech recognition algorithm.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing
day by day. EA Sports, Sony, Nintendo, are widely using data science for
enhancing user experience.
o Internet search:
When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc. All these
search engines use the data science technology to make the search experience
better, and you can get a search result with a fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving
cars. With self-driving cars, it will be easy to reduce the number of road
accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data
science is being used for tumor detection, drug discovery, medical image
analysis, virtual medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with
personalized recommendations. Such as, when you search for something on
Amazon, and you started getting suggestions for similar products, so this is
because of data science technology.

8
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with
the help of data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk
and any type of losses with an increase in customer satisfaction.

What Does Datafication Mean?

Datafication refers to the collective tools, technologies and processes used to


transform an organization to a data-driven enterprise.

Datafication is a process of transforming the aspects such as processes, behaviors,


and activities of businesses, customers, and users into quantifiable, usable, and
actionable data. It involves behind-the-scenes activities (dark data) that customers
are mostly unaware of, turning them into meaningful insights to make the business
a fully data-driven operation.

Example: Think about the amount of data we generate while talking on phone,
engaging through social media, shopping online using credit card, or simply
walking through a security camera. This kind of data is being monitored and
tracked by data scientists or miners in such a way that it creates a wide array of
opportunities. After proper investigation, they pass this valued information to
businesses who are keen on increasing their market share, profits, and brand value.

Advantages Of Datafication

1. Insight and Decision-Making:

Datafication provides valuable insights that can drive informed decision-making.


By collecting and analyzing data, organizations and individuals can gain a deeper
understanding of patterns, trends, and correlations, empowering them to make
data-driven decisions.
9
2. Enhanced Efficiency and Productivity:

Datafication enables organizations to optimize their operations, streamline


processes, and improve productivity. By leveraging data, organizations can
identify process inefficiencies, bottlenecks, or areas for improvement and
implement targeted solutions for enhanced efficiency.

3. Personalization and Customer Experience:

Datafication allows for personalized experiences and tailored offerings. By


collecting and analyzing customer data, organizations can deliver more
personalized products, services, and recommendations, resulting in improved
customer satisfaction and loyalty.

4. Innovation and New Business Models:

Datafication drives innovation by providing organizations with insights and


opportunities to develop new products, services, and business models. By
understanding data patterns and customer needs, organizations can identify unmet
demands and create innovative solutions.

5. Problem Solving and Optimization:

Datafication enables organizations to solve complex problems and optimize


processes. By analyzing data, organizations can identify areas of improvement,
perform predictive analysis, and optimize resource allocation to drive efficiency
and effectiveness.

6. Evidence-Based Policymaking:

Datafication plays a crucial role in informing evidence-based policymaking.


Governments and policymakers can leverage data to understand social, economic,

10
and environmental trends, address societal challenges, and develop targeted
policies for the benefit of citizens.

What is Exploratory Data Analysis (EDA)?


Exploratory Data Analysis (EDA) is a crucial initial step in data science projects.
It involves analyzing and visualizing data to understand its key characteristics,
uncover patterns, and identify relationships between variables.
EDA is normally carried out as a preliminary step before undertaking extra
formal statistical analyses or modeling.
Types of Exploratory Data Analysis
Depending on the number of columns we are analyzing we can divide EDA into
three types: Univariate, bivariate and multivariate .
1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal
structure. It is primarily concerned with describing the data and finding patterns
existing in a single feature. Common techniques include:
 Histograms: Used to visualize the distribution of a variable.
 Box plots: Useful for detecting outliers and understanding the spread and
skewness of the data.
 Bar charts: Employed for categorical data to show the frequency of each
category.
 Summary statistics: Calculations like mean, median, mode, variance, and
standard deviation that describe the central tendency and dispersion of the
data.
2. Bivariate Analysis
Bivariate evaluation involves exploring the connection between variables. It
enables find associations, correlations, and dependencies between pairs of
variables. Bivariate analysis is a crucial form of exploratory data analysis that
examines the relationship between two variables. Some key techniques used in
bivariate analysis:
 Scatter Plots: These are one of the most common tools used in bivariate
analysis. A scatter plot helps visualize the relationship between two
continuous variables.
 Correlation Coefficient: This statistical measure (often Pearson’s
correlation coefficient for linear relationships) quantifies the degree to which
two variables are related.
 Cross-tabulation: Also known as contingency tables, cross-tabulation is
used to analyze the relationship between two categorical variables. It shows
the frequency distribution of categories of one variable in rows and the other
11
in columns, which helps in understanding the relationship between the two
variables.
 Line Graphs: In the context of time series data, line graphs can be used to
compare two variables over time. This helps in identifying trends, cycles, or
patterns that emerge in the interaction of the variables over the specified
period.
 Covariance: Covariance is a measure used to determine how much two
random variables change together. However, it is sensitive to the scale of the
variables, so it’s often supplemented by the correlation coefficient for a more
standardized assessment of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables
in the dataset. It aims to understand how variables interact with one another,
which is crucial for most statistical modeling techniques. Techniques include:
 Pair plots: Visualize relationships across several variables simultaneously to
capture a comprehensive view of potential interactions.
 Principal Component Analysis (PCA): A dimensionality reduction
technique used to reduce the dimensionality of large datasets, while
preserving as much variance as possible.
Tools for Performing Exploratory Data Analysis
Exploratory Data Analysis (EDA) can be effectively performed using a variety
of tools and software, each offering unique features suitable for handling
different types of data and analysis requirements.
1. Python Libraries
 Pandas: Provides extensive functions for data manipulation and analysis,
including data structure handling and time series functionality.
 Matplotlib: A plotting library for creating static, interactive, and animated
visualizations in Python.
 Seaborn: Built on top of Matplotlib, it provides a high-level interface for
drawing attractive and informative statistical graphics.
 Plotly: An interactive graphing library for making interactive plots and offers
more sophisticated visualization capabilities.
2. R Packages
 ggplot2: Part of the tidyverse, it’s a powerful tool for making complex plots
from data in a data frame.
 dplyr: A grammar of data manipulation, providing a consistent set of verbs
that help you solve the most common data manipulation challenges.

12
 tidyr: Helps to tidy your data. Tidying your data means storing it in a
consistent form that matches the semantics of the dataset with the way it is
stored.
Steps for Performing Exploratory Data Analysis
Performing Exploratory Data Analysis (EDA) involves a series of steps designed
to help you understand the data , uncover underlying patterns, identify
anomalies, test hypotheses, and ensure the data is clean and suitable for further
analysis.
Step 1: Understand the Problem and the Data
The first step in any information evaluation project is to sincerely apprehend the
trouble you are trying to resolve and the statistics you have at your disposal. This
entails asking questions consisting of:
 What is the commercial enterprise goal or research question you are trying to
address?
 What are the variables inside the information, and what do they mean?
 What are the data sorts (numerical, categorical, textual content, etc.) ?
 Is there any known information on first-class troubles or obstacles?
 Are there any relevant area-unique issues or constraints?
By thoroughly knowing the problem and the information, you can better
formulate your evaluation technique and avoid making incorrect assumptions or
drawing misguided conclusions. It is also vital to contain situations and
remember specialists or stakeholders to this degree to ensure you have complete
know-how of the context and requirements.
Step 2: Import and Inspect the Data
Once you have clean expertise of the problem and the information, the following
step is to import the data into your evaluation environment (e.g., Python, R, or a
spreadsheet program). During this step, looking into the statistics is critical to
gain initial know-how of its structure, variable kinds, and capability issues.
Here are a few obligations you could carry out at this stage:
 Load the facts into your analysis environment, ensuring that the facts are
imported efficiently and without errors or truncations.
 Examine the size of the facts (variety of rows and columns) to experience its
length and complexity.
 Check for missing values and their distribution across variables, as missing
information can notably affect the quality and reliability of your evaluation.
 Identify facts sorts and formats for each variable, as these records may be
necessary for the following facts manipulation and evaluation steps.

13
 Look for any apparent errors or inconsistencies in the information, such as
invalid values, mismatched units, or outliers, that can indicate exceptional
issues with information.
Step 3: Handle Missing Data
Missing records is a joint project in many datasets, and it can significantly
impact the quality and reliability of your evaluation. During the EDA method,
it’s critical to pick out and deal with lacking information as it should be, as
ignoring or mishandling lacking data can result in biased or misleading
outcomes.
Here are some techniques you could use to handle missing statistics:
 Understand the styles and capacity reasons for missing statistics: Is the
information lacking entirely at random (MCAR), lacking at random (MAR),
or lacking not at random (MNAR)? Understanding the underlying
mechanisms can inform the proper method for handling missing information.
 Decide whether to eliminate observations with lacking values (listwise
deletion) or attribute (fill in) missing values: Removing observations with
missing values can result in a loss of statistics and potentially biased
outcomes, specifically if the lacking statistics are not MCAR. Imputing
missing values can assist in preserving treasured facts. However, the
imputation approach needs to be chosen cautiously.
 Use suitable imputation strategies, such as mean/median imputation,
regression imputation, a couple of imputations, or device-getting-to-know-
based imputation methods like k-nearest associates (KNN) or selection trees.
The preference for the imputation technique has to be primarily based on the
characteristics of the information and the assumptions underlying every
method.
 Consider the effect of lacking information: Even after imputation, lacking
facts can introduce uncertainty and bias. It is important to acknowledge those
limitations and interpret your outcomes with warning.
Handling missing information nicely can improve the accuracy and reliability of
your evaluation and save you biased or deceptive conclusions. It is likewise vital
to record the techniques used to address missing facts and the motive in the back
of your selections.
Step 4: Explore Data Characteristics
After addressing the facts that are lacking, the next step within the EDA
technique is to explore the traits of your statistics. This entails examining your
variables’ distribution, crucial tendency, and variability and identifying any
ability outliers or anomalies. Understanding the characteristics of your
information is critical in deciding on appropriate analytical techniques, figuring

14
out capability information first-rate troubles, and gaining insights that may tell
subsequent evaluation and modeling decisions.
Calculate summary facts (suggest, median, mode, preferred deviation,
skewness, kurtosis, and many others.) for numerical variables: These facts
provide a concise assessment of the distribution and critical tendency of each
variable, aiding in the identification of ability issues or deviations from expected
patterns.
Step 5: Perform Data Transformation
Data transformation is a critical step within the EDA process because it enables
you to prepare your statistics for similar evaluation and modeling. Depending on
the traits of your information and the necessities of your analysis, you may need
to carry out various ameliorations to ensure that your records are in the most
appropriate layout.
Here are a few common records transformation strategies:
 Scaling or normalizing numerical variables to a standard variety (e.g., min-
max scaling, standardization )
 Encoding categorical variables to be used in machine mastering fashions
(e.g., one-warm encoding, label encoding)
 Applying mathematical differences to numerical variables (e.g., logarithmic,
square root) to correct for skewness or non-linearity
 Creating derived variables or capabilities primarily based on current
variables (e.g., calculating ratios, combining variables)
 Aggregating or grouping records mainly based on unique variables or
situations
By accurately transforming your information, you could ensure that your
evaluation and modeling strategies are implemented successfully and that your
results are reliable and meaningful.
Step 6: Visualize Data Relationships
Visualization is an effective tool in the EDA manner, as it allows to discover
relationships between variables and become aware of styles or trends that may
not immediately be apparent from summary statistics or numerical outputs. To
visualize data relationships, explore univariate, bivariate, and multivariate
analysis.
 Create frequency tables, bar plots, and pie charts for express variables: These
visualizations can help you apprehend the distribution of classes and discover
any ability imbalances or unusual patterns.
 Generate histograms, container plots, violin plots, and density plots to
visualize the distribution of numerical variables. These visualizations can

15
screen critical information about the form, unfold, and ability outliers within
the statistics.
 Examine the correlation or association among variables using scatter plots,
correlation matrices, or statistical assessments like Pearson’s correlation
coefficient or Spearman’s rank correlation: Understanding the relationships
between variables can tell characteristic choice, dimensionality discount, and
modeling choices.
Step 7: Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the
(so-called normal)objects. They can be caused by measurement or execution
errors. The analysis for outlier detection is referred to as outlier mining. There
are many ways to detect outliers, and the removal process of these outliers from
the dataframe is the same as removing a data item from the panda’s dataframe.
Identify and inspect capability outliers through the usage of strategies like
the interquartile range (IQR) , Z-scores, or area-specific regulations: Outliers can
considerably impact the results of statistical analyses and gadget studying
fashions, so it’s essential to perceive and take care of them as it should be.
Step 8: Communicate Findings and Insights
The final step in the EDA technique is effectively discussing your findings and
insights. This includes summarizing your evaluation, highlighting fundamental
discoveries, and imparting your outcomes cleanly and compellingly.
Here are a few hints for effective verbal exchange:
 Clearly state the targets and scope of your analysis
 Provide context and heritage data to assist others in apprehending your
approach
 Use visualizations and photos to guide your findings and make them more
reachable
 Highlight critical insights, patterns, or anomalies discovered for the duration
of the EDA manner
 Discuss any barriers or caveats related to your analysis
 Suggest ability next steps or areas for additional investigation

Data Science Process

Data science process consists of six stages :

1. Discovery or Setting the research goal

2. Retrieving data

16
3. Data preparation

4. Data exploration

5. Data modeling

6. Presentation and automation

• Fig shows data science design process.

• Step 1: Discovery or Defining research goal

This step involves acquiring data from all the identified internal and external
sources, which helps to answer the business question.

• Step 2: Retrieving data

It collection of data which required for project. This is the process of gaining a
business understanding of the data user have and deciphering what each piece of
data means. This could entail determining exactly what data is required and the
best methods for obtaining it. This also entails determining what each of the data
17
points means in terms of the company. If we have given a data set from a client, for
example, we shall need to know what each column and row represents.

• Step 3: Data preparation

Data can have many inconsistencies like missing values, blank columns, an
incorrect data format, which needs to be cleaned. We need to process, explore and
condition data before modeling. The clean data, gives the better predictions.

• Step 4: Data exploration

Data exploration is related to deeper understanding of data. Try to understand how


variables interact with each other, the distribution of the data and whether there are
outliers. To achieve this use descriptive statistics, visual techniques and simple
modeling. This steps is also called as Exploratory Data Analysis.

• Step 5: Data modeling

In this step, the actual model building process starts. Here, Data scientist
distributes datasets for training and testing. Techniques like association,
classification and clustering are applied to the training data set. The model, once
prepared, is tested against the "testing" dataset.

• Step 6: Presentation and automation

Deliver the final baselined model with reports, code and technical documents in
this stage. Model is deployed into a real-time production environment after
thorough testing. In this stage, the key findings are communicated to all
stakeholders. This helps to decide if the project results are a success or a failure
based on the inputs from the model.

Roles & Responsibilities of a Data Scientist


 Management: The Data Scientist plays an insignificant managerial role
where he supports the construction of the base of futuristic and technical

18
abilities within the Data and Analytics field in order to assist various planned
and continuing data analytics projects.
 Analytics: The Data Scientist represents a scientific role where he plans,
implements, and assesses high-level statistical models and strategies for
application in the business’s most complex issues. The Data Scientist
develops econometric and statistical models for various problems including
projections, classification, clustering, pattern analysis, sampling, simulations,
and so forth.
 Strategy/Design: The Data Scientist performs a vital role in the
advancement of innovative strategies to understand the business’s consumer
trends and management as well as ways to solve difficult business problems,
for instance, the optimization of product fulfillment and entire profit.
 Collaboration: The role of the Data Scientist is not a solitary role and in this
position, he collaborates with superior data scientists to communicate
obstacles and findings to relevant stakeholders in an effort to enhance drive
business performance and decision-making.
 Knowledge: The Data Scientist also takes leadership to explore different
technologies and tools with the vision of creating innovative data-driven
insights for the business at the most agile pace feasible. In this situation, the
Data Scientist also uses initiative in assessing and utilizing new and
enhanced data science methods for the business, which he delivers to senior
management of approval.
 Other Duties: A Data Scientist also performs related tasks and tasks as
assigned by the Senior Data Scientist, Head of Data Science, Chief Data
Officer, or the Employer.

What is NumPy?

NumPy is a Python library used for working with arrays.

It also has functions for working in domain of linear algebra, fourier transform, and
matrices.

NumPy was created in 2005 by Travis Oliphant. It is an open source project.


NumPy stands for Numerical Python.

Why Use NumPy?

In Python we have lists that serve the purpose of arrays, but they are slow to
process.
19
NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting


functions that make working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are
very important.

Why is NumPy Faster Than Lists?

NumPy arrays are stored at one continuous place in memory unlike lists, so
processes can access and manipulate them very efficiently.

This behavior is called locality of reference in computer science.

Which Language is NumPy written in?

NumPy is a Python library and is written partially in Python, but most of the parts
that require fast computation are written in C or C++.

What is NumPy Used for?

NumPy is an important library generally used for:

 Machine Learning

 Data Science

 Image and Signal Processing

 Scientific Computing

 Quantum Computing

20
NumPy Ndarray

Ndarray is the n-dimensional array object defined in the numpy which stores the
collection of the similar type of elements. In other words, we can define a ndarray
as the collection of the data type (dtype) objects.

The ndarray object can be accessed by using the 0 based indexing. Each element of
the Array object contains the same size in the memory.

Creating a ndarray object

The ndarray object can be created by using the array routine of the numpy module.
For this purpose, we need to import the numpy.

Ex 1. >>> a = numpy.array

Ex 2:
numpy library imported
import numpy as np
# creating single-dimensional array
arr_s = np.arange(5)
print(arr_s)
ex 3:

Finding the dimensions of the Array

The ndim function can be used to find the dimensions of the array.
>>> import numpy as np
>>> arr = np.array([[1, 2, 3, 4], [4, 5, 6, 7], [9, 10, 11, 23]])

>>> print(arr.ndim)

21
Import NumPy

 Once NumPy is installed, import it in your applications by adding


the import keyword:

import numpy

ex:

import numpy

arr = numpy.array([1, 2, 3, 4, 5])

print(arr)

output:

[1 2 3 4 5]

NumPy as np

NumPy is usually imported under the np alias.

Create an alias with the as keyword while importing:

import numpy as np

Now the NumPy package can be referred to as np instead of numpy.

Example
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

Dimensions in Arrays

22
A dimension in arrays is one level of array depth (nested arrays).

nested array: are arrays that have arrays as their elements.


0-D Arrays

0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D
array.

Example

Create a 0-D array with value 42

import numpy as np

arr = np.array(42)

print(arr)

output:

42

1-D Arrays

An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.

These are the most common and basic arrays.

Example

Create a 1-D array containing the values 1,2,3,4,5:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

output:

23
[1 2 3 4 5]

2-D Arrays

An array that has 1-D arrays as its elements is called a 2-D array.

These are often used to represent matrix or 2nd order tensors.

NumPy has a whole sub module dedicated towards matrix operations


called numpy.mat
Example

Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr)

output:

[[1 2 3]
[4 5 6]]
3-D arrays

An array that has 2-D arrays (matrices) as its elements is called 3-D array.

These are often used to represent a 3rd order tensor.

Example

Create a 3-D array with two 2-D arrays, both containing two arrays with the values
1,2,3 and 4,5,6:

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

24
print(arr)

Output:

[[[1 2 3]
[4 5 6]]

[[1 2 3]
[4 5 6]]]
Data Types in NumPy

Below is a list of all data types in NumPy and the characters used to represent
them.

 i - integer
 b - boolean
 u - unsigned integer
 f - float
 c - complex float
 m - timedelta
 M - datetime
 O - object
 S - string
 U - unicode string
 V - fixed chunk of memory for other type ( void )

Checking the Data Type of an Array

The NumPy array object has a property called dtype that returns the data type of
the array:

Example

Get the data type of an array object:

import numpy as np

arr = np.array([1, 2, 3, 4])

25
print(arr.dtype)

output:

int64

Get the data type of an array containing strings:

import numpy as np

arr = np.array(['apple', 'banana', 'cherry'])

print(arr.dtype)

output:

Scalar operations on Numpy arrays


Scalar operations on Numpy arrays include performing addition or subtraction, or
multiplication on each element of a Numpy array. Let us see the following
examples:

 Addition Operation on Numpy Arrays


 Subtraction Operation on Numpy Arrays
 Multiplication Operation on Numpy Arrays
 Division Operation on Numpy Arrays

Addition Operation on Numpy Arrays


The Addition Operation is adding a value to each element of a NumPy array. Let
us see an example:

import numpy as np

26
# Create two arrays
n1 = np.array([15, 20, 25, 30])
n2 = np.array([65, 75, 85, 95])

print("Array1 =", n1)


print("Array2 =", n2)

# performing addition on each element


n1 = n1 + 10;
n2 = n2 + 20;

print("Array1 (updated) ", n1)


print("Array2 (updated) ", n2)

Output

Array1 = [15 20 25 30]


Array2 = [65 75 85 95]
Array1 (updated) [25 30 35 40]
Array2 (updated) [ 85 95 105 115]

Subtraction Operation on Numpy Arrays


The Subtraction operation is subtracting a value from each element of
a NumPy array. Let us see an example:

import numpy as np

# Create two Numpy arrays


n1 = np.array([15, 20, 25, 30])
n2 = np.array([65, 75, 85, 95])

print("Array1 =", n1)


print("Array2 =", n2)

# performing subtraction on each element

27
n1 = n1 - 5;
n2 = n2 - 10;

print("Array1 (updated) ", n1)


print("Array2 (updated) ", n2)

Output
Array1 = [15 20 25 30]

Array2 = [65 75 85 95]


Array1 (updated) [10 15 20 25]
Array2 (updated) [55 65 75 85]

Multiplication Operation on Numpy Arrays


The Multiplication operation is multiplying a value to each element of
a NumPy array. Let us see an example:

import numpy as np

# Create two arrays


n1 = np.array([15, 20, 25, 30])
n2 = np.array([65, 75, 85, 95])

print("Array1 =", n1)


print("Array2 =", n2)

# performing multiplication on each element


n1 = n1 * 5;
n2 = n2 * 10;

print("Array1 (updated) ", n1)


print("Array2 (updated) ", n2)

28
Output

Array1 = [15 20 25 30]

Array2 = [65 75 85 95]


Array1 (updated) [ 75 100 125 150]
Array2 (updated) [650 750 850 950]

Division Operation on Numpy Arrays


The Division operation is to divide a value from each element of a NumPy array.
Let us see an example:

import numpy as np

# Create two arrays


n1 = np.array([15, 20, 25, 30])
n2 = np.array([65, 75, 85, 95])

print("Array1 =", n1)


print("Array2 =", n2)

# performing division on each element


n1 = n1 / 5;
n2 = n2 / 5;

print("Array1 (updated) ", n1)


print("Array2 (updated) ", n2)

Output

Array1 = [15 20 25 30]


Array2 = [65 75 85 95]
Array1 (updated) [3. 4. 5. 6.]

29
Array2 (updated) [13. 15. 17. 19.]

Array Indexing
Array indexing is accessing array elements. In Numpy, access an array element
using the index number. The 0th index is element 1 and the flow goes on as shown
below:

 index 0 – element 1
 index 1 – element 2
 index 2 – element 3
 index 3 – element 4
In this lesson, we will cover the following topics to understand Array Indexing in
NumPy:

 Access elements from a 1D Array


 Access elements from a 2D Array
 Access elements from a 3D Array
 Access elements from the last with Negative Indexing
Let us begin with accessing elements from a One-Dimensional i.e. 1D array:

Access elements from a One-Dimensional array


The following are some examples to access specific elements from a 1D array:

Example: Access the 1st element (index 0) from a One-Dimensional array

import numpy as np

n = np.array([10, 20, 30, 40, 50])


print(n[0])

30
Output

10

Example: Access the 4th element (index 3) from a One-Dimensional array

import numpy as np

n = np.array([10, 20, 30, 40, 50])


print(n[3])

Output

40

Access elements from a Two-Dimensional array


Accessing elements work as a matrix in a 2D Array i.e.

a[0,0] = dimension 1 element 1st


a[0,1] = dimension 1 element 2nd
a[0,2] = dimension 1 element 3rd

a[1,0] = dimension 2 element 1st


a[1,1] = dimension 2 element 2nd
a[1,2] = dimension 2 element 3rd

a[2,0] = dimension 3 element 1st


a[2,1] = dimension 3 element 2nd
a[2,2] = dimension 3 element 3rd

Following are some examples to access specific elements from a 2D array:


31
Example: Accessing 1st dimension elements from a 2D array

import numpy as np

n = np.array([[1,3,5],[4,8,12]])

print(n[0,0])
print(n[0,1])
print(n[0,2])

Output

1
3
5

Example: Accessing 2nd dimension elements from a 2D array

import numpy as np

n = np.array([[1,3,5],[4,8,12]])
print(n[1,0])
print(n[1,1])
print(n[1,2])

Output
4
8
12

Access elements from a Three-Dimensional Array


Following are some examples to access specific elements from a 3D array:

32
Example1

import numpy as np

n = np.array([[[5,10,15],[20,25,30]],[[35,40,45],[50,55,60]]])
print(n[0,0,0])
print(n[0,0,1])
print(n[0,0,2])

Output

5
10
15

Example2

import numpy as np

n = np.array([[[5,10,15],[20,25,30]],[[35,40,45],[50,55,60]]])

print(n[1,0,0])
print(n[1,0,1])
print(n[1,0,2])

Output
35
40
45

Access the array from the last with Negative Indexing


Arrays can be accessed with negative indexing. This gives the last element.

33
Example 1: Access the last element from a 1D array with negative indexing

import numpy as np

n = np.array([5, 10, 15])


print('Last element = ', n[-1])
Output

Last element = 15
Example 2: Access the last element from a 2D array with negative indexing
1

2 import numpy as np

4 n = np.array([[1, 3, 5], [4, 8, 12]])

5 print('Last element = ', n[0, -1])

Output

2 Last element = 5

Let us see another example:

2 import numpy as np

34
4 n = np.array([[1,3,5],[4,8,12]])

5 print('Last element = ', n[1,-1])

Output

2 Last element = 12

Basic Slicing and Advanced Indexing in NumPy

Last Updated : 08 Mar, 2024




Indexing a NumPy array means accessing the elements of the NumPy array at the
given index.
There are two types of indexing in NumPy: basic indexing and advanced
indexing.
Slicing a NumPy array means accessing the subset of the array. It
means extracting a range of elements from the data.
ndexing an NumPy Array
Indexing is used to extract individual elements from a one-dimensional array.
It can also be used to extract rows, columns, or planes in a multi-
dimensional NumPy array.
Example: Index in NumPy array

Element 23 21 55 65 23

Index 0 1 2 3 4

35
In the above example, we have highlighted the element “55” which is at index
“2”.
Indexing Using Index arrays
Indexing can be done in NumPy by using an array as an index.
Numpy arrays can be indexed with other arrays or any other sequence with the
exception of tuples. The last element is indexed by -1 second last by -2 and so
on.
In the case of slicing, a view or shallow copy of the array is returned but in an
index array, a copy of the original array is returned.
mport numpy as np
# Create a sequence of integers from 10 to 1 with a step of -2
a = np.arange(10, 1, -2)
print("\n A sequential array with a negative step: \n",a)
# Indexes are specified inside the np.array method.
newarr = a[np.array([3, 1, 2 ])]
print("\n Elements at these indices are:\n",newarr)
Output :
A sequential array with a negative step:
[10 8 6 4 2]
Elements at these indices are:
[4 8 6]
Types of Indexing in NumPy Array
There are two types of indexing used in Python NumPy:
 Basic slicing and indexing
 Advanced indexing
o Purely integer indexing
o Boolean indexing
Basic Slicing and indexing
Basic slicing and indexing is used to access a specific element or range of
elements from a NumPy array.
Basic slicing and indexing only return the view of the array.
Consider the syntax x[obj] where “x” is the array and “obj” is the index. The
slice object is the index in the case of basic slicing.
Basic slicing occurs when obj is :
1. A slice object that is of the form start: stop: step
2. An integer
3. Or a tuple of slice objects and integers
36
All arrays generated by basic slicing are always ‘view’ of the original array.
Example: Basic Slicing in NumPy array
 Python
 mport numpy as np
 # Arrange elements from 0 to 19
 a = np.arrange(20)
 print("\n Array is:\n ",a)
 print("\n a[15]=",a[15])
 # a[start:stop:step]
 print("\n a[-8:17:1] = ",a[-8:17:1])
 print("\n a[10:] = ",a[10:])
 Output :
 Array is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
a[15]= 15
a[-8:17:1] = [12 13 14 15 16]
a[10:] = [10 11 12 13 14 15 16 17 18 19]

NumPy Boolean Indexing

In NumPy, boolean indexing allows us to filter elements from an array based on a


specific condition.

We use boolean masks to specify the condition.

Boolean Masks in NumPy

Boolean mask is a numpy array containing truth values (True/False) that


correspond to each element in the array.

Suppose we have an array named array1.

37
array1 = np.array([12, 24, 16, 21, 32, 29, 7, 15])

Now let's create a mask that selects all elements of array1 that are greater than 20.

boolean_mask = array1 > 20

Here, array1 > 20 creates a boolean mask that evaluates to True for elements that
are greater than 20, and False for elements that are less than or equal to 20.
The resulting mask is an array stored in the boolean_mask variable as:

[False, True, False, True, True, True, False, False]

1D Boolean Indexing in NumPy

Boolean Indexing allows us to create a filtered subset of an array by passing a


boolean mask as an index.

The boolean mask selects only those elements in the array that have a True value
at the corresponding index position.
Let's create a boolean indexing of the boolean mask in the above example.

array1[boolean_mask]

This results in

[24, 21, 32, 29]

Now let's see another example.

38
We'll use the boolean indexing to select only the odd numbers from an array.

import numpy as np

# create an array of numbers


array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# create a boolean mask


boolean_mask = array1 % 2 != 0

# boolean indexing to filter the odd numbers


result = array1[boolean_mask]

print(result)

# Output: [ 1 3 5 7 9]
Run Code

In this example, we have used the boolean indexing to select only the odd numbers
from the array1 array.
Here, the expression numbers % 2 != 0 is a boolean mask. If the elements
of array1 meet the condition specified in the boolean mask, it replaces the element
(odd numbers) with True, and even numbers with False.
With boolean indexing, a filtered array with only the True valued elements is
returned. Hence, we get an array with odd numbers.
import numpy as np

# create an array of integers


array1 = np.array([1, 2, 4, 9, 11, 16, 18, 22, 26, 31, 33, 47, 51, 52])

# create a boolean mask using combined logical operators


boolean_mask = (array1 < 10) | (array1 > 40)

39
# apply the boolean mask to the array
result = array1[boolean_mask]

print(result)

# Output: [ 1 2 4 9 47 51 52]
NumPy Fancy Indexing
In NumPy, fancy indexing allows us to use an array of indices to access multiple
array elements at once.
Fancy indexing can perform more advanced and efficient array operations,
including conditional filtering, sorting, and so on.
Example: NumPy Fancy Indexing
import numpy as np

array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8])

# select a single element


simple_indexing = array1[3]

print("Simple Indexing:",simple_indexing) # 4

# select multiple elements


fancy_indexing = array1[[1, 2, 5, 7]]

print("Fancy Indexing:",fancy_indexing) # [2 3 6 8]
output:
Simple Indexing: 4
Fancy Indexing: [2 3 6 8]

Data processing using arrays

40
With the NumPy package, we can easily solve many kinds of data processing tasks
without writing complex loops. It is very helpful for us to control our code as well
as the performance of the program. In this part, we want to introduce some
mathematical and statistical functions.

See the following table for a listing of mathematical and statistical functions:

Functio
n Description Example

sum Calculate the sum of all the >>> a = np.array([[2,4],


elements in an array or along the [3,5]])
axis >>> np.sum(a, axis=0)
array([5, 9])

prod Compute the product of array >>> np.prod(a, axis=1)


elements over the given axis array([8, 15])

diff Calculate the discrete difference >>> np.diff(a, axis=0)


along the given axis array([[1,1]])

gradient Return the gradient of an array >>> np.gradient(a)


[array([[1., 1.], [1., 1.]]),
array([[2., 2.], [2., 2.]])]

cross Return the cross product of two ...


arrays
NumPy Array – Logical Operations
Logical operations are used to find the logical relation between two arrays or lists
or variables. We can perform logical operations using NumPy between two data.
Below are the various logical operations we can perform on Numpy arrays:
AND
The numpy module supports the logical_and operator. It is used to relate between
two variables. If two variables are 0 then output is 0, if two variables are 1 then
output is 1 and if one variable is 0 and another is 1 then output is 0.
Syntax:
41
numpy.logical_and(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)
Example 1:
This code gives demo on boolean operations with logical_and operator.
 Python3

# importing numpy module

import numpy as np

# list 1 represents an array with boolean values

list1 = [True, False, True, False]

# list 2 represents an array with boolean values

list2 = [True, True, False, True]

# logical operations between boolean values

print('Operation between two lists = ',

42
np.logical_and(list1, list2))

Output:

Example 2:
 Python3

# importing numpy module

import numpy as np

# list 1 represents an array

# with integer values

list1 = [1, 2, 3, 4, 5, 0]

# list 2 represents an array

# with integer values

list2 = [0, 1, 2, 3, 4, 0]

43
# logical operations between integer values

print('Operation between two lists:',

np.logical_and(list1, list2))

Output:

OR
The NumPy module supports the logical_or operator. It is also used to relate
between two variables. If two variables are 0 then output is 0, if two variables are 1
then output is 1 and if one variable is 0 and another is 1 then output is 1.
Syntax:
numpy.logical_or(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)
Example:
 Python3

# importing numpy module

import numpy as np

44
# logical operations between boolean values

print('logical_or operation = ',

np.logical_or(True, False))

a=2

b=6

print('logical or Operation between two variables = ',

np.logical_or(a, b))

a=0

b=0

print('logical or Operation between two variables = ',

np.logical_or(a, b))

45
# list 1 represents an array with integer values

list1 = [1, 2, 3, 4, 5, 0]

# list 2 represents an array with integer values

list2 = [0, 1, 2, 3, 4, 0]

# logical operations between integer values

print('Operation between two lists = ',

np.logical_or(list1, list2))

Output:

NOT
The logical_not operation takes one value and converts it into another value. If the
value is 0, then output is 1, if value is greater than or equal to 1 output is 0.
Syntax:
numpy.logical_not(var1)
Where, var1is a single variable or a list/array.
Return type: Boolean value (True or False)
Example:
 Python3
46
# importing numpy module

import numpy as np

# logical not operations for boolean value

print('logical_not operation = ',

np.logical_not(True))

a=2

b=6

print('logical_not Operation = ',

np.logical_not(a))

print('logical_not Operation = ',

np.logical_not(b))

47
# list 1 represents an array with integer values

list1 = [1, 2, 3, 4, 5, 0]

# logical operations between integer values

print('Operation in list = ',

np.logical_not(list1))

Output:

XOR
The logical_xor performs the xor operation between two variables or lists. In this
operation, if two values are same it returns 0 otherwise 1.
Syntax:
numpy.logical_xor(var1,var2)
Where, var1 and var2 are a single variable or a list/array.
Return type: Boolean value (True or False)
Example:
 Python3

# importing numpy module

48
import numpy as np

# logical operations between boolean values

print('Operation between true and true ( 1 and 1) = ',

np.logical_xor(True, True))

print('Operation between true and false ( 1 and 0) = ',

np.logical_xor(True, False))

print('Operation between false and true ( 0 and 1) = ',

np.logical_xor(False, True))

print('Operation between false and false (0 and 0)= ',

np.logical_xor(False, False))

# list 1 represents an array with boolean values

49
list1 = [True, False, True, False]

# list 2 represents an array with boolean values

list2 = [True, True, False, True]

# logical operations between boolean values

print('Operation between two lists = ',

np.logical_xor(list1, list2))

# list 1 represents an array

# with integer values

list1 = [1, 2, 3, 4, 5, 0]

50
# list 2 represents an array

# with integer values

list2 = [0, 1, 2, 3, 4, 0]

# logical operations between integer values

print('Operation between two lists = ',

np.logical_xor(list1, list2))

Output:

Python – Boolean Array in NumPy


n NumPy, boolean arrays are straightforward NumPy arrays with array
components that are either “True” or “False.”
Note: 0 and None are considered False and everything else is considered True.
import numpy as np

arr = np.array([1, 0, 1, 0, 0, 1, 0])

print(f'Original Array: {arr}')

bool_arr = np.array(arr, dtype='bool')

print(f'Boolean Array: {bool_arr}')


Output:
Original Array: [1 0 1 0 0 1 0]
Boolean Array: [ True False True False False True False]
51
NumPy Sorting Arrays
Sorting Arrays

Sorting means putting elements in an ordered sequence.

Ordered sequence is any sequence that has an order corresponding to elements,


like numeric or alphabetical, ascending or descending.

The NumPy ndarray object has a function called sort(), that will sort a specified
array.

Example

Sort the array:

import numpy as np

arr = np.array([3, 2, 0, 1])

print(np.sort(arr))

output:

[0 1 2 3]

Numpy np.unique() method

With the help of np.unique() method, we can get the unique values from an array
given as parameter in np.unique() method.
Syntax : np.unique(Array)
Return : Return the unique of an array.
Example #1 :
In this example we can see that by using np.unique() method, we are able to get the
unique values from an array by using this method.

52
# import numpy

import numpy as np

a = [1, 2, 2, 4, 3, 6, 4, 8]

# using np.unique() method

gfg = np.unique(a)

print(gfg)

Output :
[1 2 3 4 6 8]

53

You might also like