Machine Learning Cookbook with Python: Create ML and Data Analytics Projects Using Some Amazing Open Datasets (English Edition)
By Rehan Guha
()
About this ebook
Related to Machine Learning Cookbook with Python
Related ebooks
Applied Machine Learning Solutions with Python: Production-ready ML Projects Using Cutting-edge Libraries and Powerful Statistical Techniques (English Edition) Rating: 1 out of 5 stars1/5Advanced Machine Learning with Python Rating: 0 out of 5 stars0 ratingsDeep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition) Rating: 0 out of 5 stars0 ratingsMachine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4 Rating: 0 out of 5 stars0 ratingsMachine Learning with Spark and Python: Essential Techniques for Predictive Analytics Rating: 0 out of 5 stars0 ratingsPython Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsData Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition) Rating: 0 out of 5 stars0 ratingsMarkov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python Rating: 2 out of 5 stars2/5Machine Learning For Dummies Rating: 4 out of 5 stars4/5Statistics for Machine Learning: Implement Statistical methods used in Machine Learning using Python (English Edition) Rating: 0 out of 5 stars0 ratingsDesigning Machine Learning Systems with Python Rating: 0 out of 5 stars0 ratingsMachine Learning for Beginners: Learn to Build Machine Learning Systems Using Python (English Edition) Rating: 0 out of 5 stars0 ratingsApplied Machine Learning Solutions with Python: SOLUTIONS FOR PYTHON, #1 Rating: 0 out of 5 stars0 ratingsLarge Scale Machine Learning with Python Rating: 2 out of 5 stars2/5Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions Rating: 0 out of 5 stars0 ratingsNeural Networks with Python Rating: 0 out of 5 stars0 ratingsFrank Kane's Taming Big Data with Apache Spark and Python Rating: 0 out of 5 stars0 ratingsArtificial Intelligence with Python - Second Edition: Your complete guide to building intelligent apps using Python 3.x, 2nd Edition Rating: 0 out of 5 stars0 ratingsMachine Learning For Beginners Guide Algorithms: Supervised & Unsupervsied Learning. Decision Tree & Random Forest Introduction Rating: 0 out of 5 stars0 ratingsMachine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition Rating: 0 out of 5 stars0 ratingsReinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5The Coming Wave: AI, Power, and Our Future Rating: 5 out of 5 stars5/5Nexus: A Brief History of Information Networks from the Stone Age to AI Rating: 4 out of 5 stars4/5Some Future Day: How AI Is Going to Change Everything Rating: 0 out of 5 stars0 ratingsCreating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Co-Intelligence: Living and Working with AI Rating: 4 out of 5 stars4/5Writing AI Prompts For Dummies Rating: 0 out of 5 stars0 ratingsThe AI-Driven Leader: Harnessing AI to Make Faster, Smarter Decisions Rating: 2 out of 5 stars2/5Enterprise AI For Dummies Rating: 3 out of 5 stars3/5AI Money Machine: Unlock the Secrets to Making Money Online with AI Rating: 5 out of 5 stars5/5A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going Rating: 4 out of 5 stars4/53550+ Most Effective ChatGPT Prompts Rating: 0 out of 5 stars0 ratingsCoding with AI For Dummies Rating: 1 out of 5 stars1/5Make Money with ChatGPT: Your Guide to Making Passive Income Online with Ease using AI: AI Wealth Mastery Rating: 2 out of 5 stars2/5AI Investing For Dummies Rating: 0 out of 5 stars0 ratings100M Offers Made Easy: Create Your Own Irresistible Offers by Turning ChatGPT into Alex Hormozi Rating: 0 out of 5 stars0 ratingsWhy Machines Learn: The Elegant Math Behind Modern AI Rating: 4 out of 5 stars4/5THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION Rating: 5 out of 5 stars5/5Midjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/5Digital Dharma: How AI Can Elevate Spiritual Intelligence and Personal Well-Being Rating: 5 out of 5 stars5/5ChatGPT Millionaire: Work From Home and Make Money Online, Tons of Business Models to Choose from Rating: 5 out of 5 stars5/5
Reviews for Machine Learning Cookbook with Python
0 ratings0 reviews
Book preview
Machine Learning Cookbook with Python - Rehan Guha
CHAPTER 1
Boston Crime
Introduction
Everyone has heard that "Data¹ is the new oil," and data is freely available everywhere, starting from newspaper, Twitter, etc. Just the crude oil, i.e., data, has no value by itself, so we will be using different techniques to make sense of the data and gain some information² out of it.
Let us start with the basics like O.S.E.M.N. Framework, E.D.A., and some visualization techniques. This chapter will mostly cover some basics of Data Exploration and how to implement some techniques for data cleaning as well.
This chapter is extremely important as it is the introduction to Machine Learning and Data Science. The readers will get the skill and the confidence to play with the data and draw various insights out of it.
Structure
Types of data
Let’s talk about the Boston dataset
O.S.E.M.N. framework
Objective
In this chapter, we will mostly look into the concept of data analysis and see some techniques to clean the data. At the end of the chapter, the reader should have the ability to process the data and have great insight into the data we are using.
What is Data?
As per the definition of data, it is the facts and statistics collected together for reference or analysis.
But we will define Data as a set of values of subjects concerning qualitative or quantitative variables. Data and information or knowledge are often used interchangeably; however, data becomes information when it is viewed in context or post analysis³.
Types of Data
Structured data - Structured data is generally stored in tabular form, and it can be stored in a relational database. It can be names, phone numbers, location, or other metrics like distance, loan amount, etc. and generally, we can query the relational table with SQL.
Semi-structured data - Semi-structured data is similar to structured data, but it does not follow the conventional relational table structure. Files like XML, JSON, etc. are examples of semi-structured data.
Unstructured data - As the name suggests, unstructured data follows no formal structure or relational table. E.g., texts, tweets from Twitter, Media (Audio-Video, etc.)
These are some of the building blocks of data types⁴ which are used in machine learning.
For this chapter, we will use structured data taken from the Boston Police Department.
Let’s talk about the Boston dataset
Boston crime dataset⁵ is a collection of crime incident reports that are provided by the Boston Police Department (BPD). This dataset documents the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. Records in the new system begin from June of 2015.
So, the first thing we should do is to know the different features in the dataset.
Data Dictionary
Any standard dataset will contain a data dictionary. As per definition, a data dictionary is a set of information describing the contents, format, and structure of a database and the relationship between its elements, used to control access to and manipulation of the database.
As mentioned before, we are using structured data, so this can be stored in a relational database. Table 1.1 shows the data dictionary with all the details.
Table 1.1: Data Dictionary
For any dataset, we need to know about the data and analyze its each each feature and its contribution.
The best way to know more about the data is to get your hands dirty.
Let’s start with Python and Jupyter Notebook to explore the dataset.
First, we need to set up the machine and install the required packages to get things going. Please refer to the GitHub link:
https://github.com/bpbpublications/Machine-Learning-Cookbook-with-Python.
The entire code can be compiled and executed online, just with a web browser using Binder. Please use the above GitHub link to find more details about it.
O.S.E.M.N. framework
All Machine Learning Projects and Data Science Projects have a basic framework named O.S.E.M.N. (Obtaining, Scrubbing, Exploring, Modeling, INterpreting)⁶, and we can see with framework Data Fetching, Data Cleaning, and Data Exploring takes up 60% of the pipeline.
What is Data Obtaining?
In this chapter, we are using data from the Boston Police Department repository, and downloading from it is considered as Data Obtaining/Fetching. There can be cases where we need to scrap⁷ the data from website, media files, log files, etc… All the steps required to gather the data are considered Data Obtaining/Fetching.
After downloading the Data from the given URL, we need to load the dataset using pandas and store it in the DataFrame (Figure 1.1) to start cleaning the data for Exploring.
Figure 1.1: Data loading
We are using a DataFrame from pandas to store the Boston Crime Dataset. pandas is a popular library used for Machine Learning and Data Science.
What is Data Scrubbing?
As the name suggests, Data Scrubbing⁸ is a process of cleaning the data which will be fit for use in the next stage that is Data Exploration and Analysis. Data Scrubbing, also known as Data Cleaning, takes up the maximum time during the process of Data Analysis. During this phase, we will mostly focus on handling incorrect data, missing values, and errors related to the data structures.
Finding Data Types
Depending on the data type, different data cleaning techniques can be applied. And it’s not just cleaning; we need to scrub the data logically to reduce ambiguity. Let’s find the data type for every column in the dataset, as shown in Figure 1.2.
Figure 1.2: Data types of all the features
As you can see, there are multiple types of data types, and this data is a non-homogeneous dataset.
The different data types⁹ in the Boston Crime dataset are as follows:
int64
float64
object
Referring to the Data Dictionary of the dataset, we can see how the features are correlated to the real world.
In Figure 1.3, we can see all the data and how it looks and why it is of the given data type.
Figure 1.3: Sample data
Comparing the data with Figure 1.2, we can see the types and the structure of the data. Looking at the data, we get information about how to clean the data.
How to Handle Missing Data?
Handling missing values from the data will improve the quality of the data we are using, and in turn, it will yield accurate analysis.
To handle missing data, we generally ask questions as given below:
Are there any missing values?
Are the missing values significant enough to handle it?
We will answer the above set of questions in due time.
So, let us first see all the missing values of all the features of the Boston Crime Dataset (Figure 1.4).
Figure 1.4: Missing value count for all columns
Let us consider the variables SHOOTING
and STREET
and find the count of missing values.
Figure 1.5: Frequency table for SHOOTING
As we can see, the feature SHOOTING
is a Boolean field (Figure 1.5) from the details of the features and the data, so the values can be either True or False. Either shooting can take place, or it won’t take place. As per the missing value report, we can see that around 1723 records have True/Yes rest all the records are missing. But the good thing about the Boolean feature is if something is missing, then it will be either True or False. In the case of SHOOTING,
the missing values will be No/False as we already have some value for Yes/True.
As we know how to tackle the SHOOTING
feature we should proceed with it.
You can see in Figure 1.6 that we have first replaced all the missing values with N
as we concluded before that in this case of NaN,
it is the same as N.
Figure 1.6: Code for replacing the missing values
After confirming that there are no missing values anymore, we then replaced the Y
and N
with True
and False,
respectively, as we know that to represent a Boolean value, we either need 1/0 or True/False.
Figure 1.7: Frequency table for STREET
But if we look at the other feature, i.e., STREET
(Figure 1.7), we can see that it is not a binary value. So, figuring out the right missing value is next to impossible. In these cases, we need to use a different strategy to tackle this problem. Individually, finding the solution for columns can be fruitful, but here we will make a general approach to solve the missing value.
Similarly, we need to find the missing values for all the columns and handle them in a generic pattern.
As we can see from the above figure (Figure 1.4) features in Figure 1.5 are missing with their respective missing count.
Now we have a list of the features which have missing values and also their respective dtypes. Seeing the data and the contents, we can design a strategy to fill the missing values.
In this chapter, we will discuss a few of the techniques for missing values. While designing the strategy, we need to take care of all the data types like numeric, string/object, etc. and fill them logically.
Let us take a look at another unique column and a unique way to handle the missing value. How should we fill the missing values for the column Lat
and Long
? (refer to Figure 1.2 and Figure 1.3).
Suppose we fill the integer missing values with the mean or median of that feature column — that can be one of the strategies. Still, in this case, we see Lat
and Long
are the coordinates, and therefore, filling it with the mean value of Lat
and Long
column respectively may give a location which is not in Boston, and it won’t be accurate to perform that action. Filling it with the median value won’t give us any wrong location, but it will increase the rate of the Crime Incident by 27,378 for that median