Discover millions of audiobooks, ebooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Machine Learning Cookbook with Python: Create ML and Data Analytics Projects Using Some Amazing Open Datasets (English Edition)
Machine Learning Cookbook with Python: Create ML and Data Analytics Projects Using Some Amazing Open Datasets (English Edition)
Machine Learning Cookbook with Python: Create ML and Data Analytics Projects Using Some Amazing Open Datasets (English Edition)
Ebook425 pages3 hours

Machine Learning Cookbook with Python: Create ML and Data Analytics Projects Using Some Amazing Open Datasets (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Machine Learning does not have to be intimidating at all. This book focuses on the concepts of Machine Learning and Data Analytics with mathematical explanations and programming examples. All the codes are written in Python as it is one of the most popular programming languages used for Data Science and Machine Learning. Here I have leveraged multiple libraries like NumPy, Pandas, scikit-learn, etc. to ease our task and not reinvent the wheel. There are five projects in total, each addressing a unique problem. With the recipes in this cookbook, one will learn how to solve Machine Learning problems for real-time data and perform Data Analysis and Analytics, Classification, and beyond. The datasets used are also unique and will help one to think, understand the problem and proceed towards the goal. The book is not saturated with Mathematics, but mostly all the Mathematical concepts are covered for the important topics. Every chapter typically starts with some theory and prerequisites, and then it gradually dives into the implementation of the same concept using Python, keeping a project in the background.
LanguageEnglish
Release dateNov 9, 2020
ISBN9789389845983
Machine Learning Cookbook with Python: Create ML and Data Analytics Projects Using Some Amazing Open Datasets (English Edition)

Related to Machine Learning Cookbook with Python

Related ebooks

Intelligence (AI) & Semantics For You

View More

Reviews for Machine Learning Cookbook with Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Machine Learning Cookbook with Python - Rehan Guha

    CHAPTER 1

    Boston Crime

    Introduction

    Everyone has heard that "Data¹ is the new oil," and data is freely available everywhere, starting from newspaper, Twitter, etc. Just the crude oil, i.e., data, has no value by itself, so we will be using different techniques to make sense of the data and gain some information² out of it.

    Let us start with the basics like O.S.E.M.N. Framework, E.D.A., and some visualization techniques. This chapter will mostly cover some basics of Data Exploration and how to implement some techniques for data cleaning as well.

    This chapter is extremely important as it is the introduction to Machine Learning and Data Science. The readers will get the skill and the confidence to play with the data and draw various insights out of it.

    Structure

    Types of data

    Let’s talk about the Boston dataset

    O.S.E.M.N. framework

    Objective

    In this chapter, we will mostly look into the concept of data analysis and see some techniques to clean the data. At the end of the chapter, the reader should have the ability to process the data and have great insight into the data we are using.

    What is Data?

    As per the definition of data, it is the facts and statistics collected together for reference or analysis. But we will define Data as a set of values of subjects concerning qualitative or quantitative variables. Data and information or knowledge are often used interchangeably; however, data becomes information when it is viewed in context or post analysis³.

    Types of Data

    Structured data - Structured data is generally stored in tabular form, and it can be stored in a relational database. It can be names, phone numbers, location, or other metrics like distance, loan amount, etc. and generally, we can query the relational table with SQL.

    Semi-structured data - Semi-structured data is similar to structured data, but it does not follow the conventional relational table structure. Files like XML, JSON, etc. are examples of semi-structured data.

    Unstructured data - As the name suggests, unstructured data follows no formal structure or relational table. E.g., texts, tweets from Twitter, Media (Audio-Video, etc.)

    These are some of the building blocks of data types⁴ which are used in machine learning.

    For this chapter, we will use structured data taken from the Boston Police Department.

    Let’s talk about the Boston dataset

    Boston crime dataset⁵ is a collection of crime incident reports that are provided by the Boston Police Department (BPD). This dataset documents the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. Records in the new system begin from June of 2015.

    So, the first thing we should do is to know the different features in the dataset.

    Data Dictionary

    Any standard dataset will contain a data dictionary. As per definition, a data dictionary is a set of information describing the contents, format, and structure of a database and the relationship between its elements, used to control access to and manipulation of the database. As mentioned before, we are using structured data, so this can be stored in a relational database. Table 1.1 shows the data dictionary with all the details.

    Table 1.1: Data Dictionary

    For any dataset, we need to know about the data and analyze its each each feature and its contribution.

    The best way to know more about the data is to get your hands dirty.

    Let’s start with Python and Jupyter Notebook to explore the dataset.

    First, we need to set up the machine and install the required packages to get things going. Please refer to the GitHub link:

    https://github.com/bpbpublications/Machine-Learning-Cookbook-with-Python.

    The entire code can be compiled and executed online, just with a web browser using Binder. Please use the above GitHub link to find more details about it.

    O.S.E.M.N. framework

    All Machine Learning Projects and Data Science Projects have a basic framework named O.S.E.M.N. (Obtaining, Scrubbing, Exploring, Modeling, INterpreting)⁶, and we can see with framework Data Fetching, Data Cleaning, and Data Exploring takes up 60% of the pipeline.

    What is Data Obtaining?

    In this chapter, we are using data from the Boston Police Department repository, and downloading from it is considered as Data Obtaining/Fetching. There can be cases where we need to scrap⁷ the data from website, media files, log files, etc… All the steps required to gather the data are considered Data Obtaining/Fetching.

    After downloading the Data from the given URL, we need to load the dataset using pandas and store it in the DataFrame (Figure 1.1) to start cleaning the data for Exploring.

    Figure 1.1: Data loading

    We are using a DataFrame from pandas to store the Boston Crime Dataset. pandas is a popular library used for Machine Learning and Data Science.

    What is Data Scrubbing?

    As the name suggests, Data Scrubbing⁸ is a process of cleaning the data which will be fit for use in the next stage that is Data Exploration and Analysis. Data Scrubbing, also known as Data Cleaning, takes up the maximum time during the process of Data Analysis. During this phase, we will mostly focus on handling incorrect data, missing values, and errors related to the data structures.

    Finding Data Types

    Depending on the data type, different data cleaning techniques can be applied. And it’s not just cleaning; we need to scrub the data logically to reduce ambiguity. Let’s find the data type for every column in the dataset, as shown in Figure 1.2.

    Figure 1.2: Data types of all the features

    As you can see, there are multiple types of data types, and this data is a non-homogeneous dataset.

    The different data types⁹ in the Boston Crime dataset are as follows:

    int64

    float64

    object

    Referring to the Data Dictionary of the dataset, we can see how the features are correlated to the real world.

    In Figure 1.3, we can see all the data and how it looks and why it is of the given data type.

    Figure 1.3: Sample data

    Comparing the data with Figure 1.2, we can see the types and the structure of the data. Looking at the data, we get information about how to clean the data.

    How to Handle Missing Data?

    Handling missing values from the data will improve the quality of the data we are using, and in turn, it will yield accurate analysis.

    To handle missing data, we generally ask questions as given below:

    Are there any missing values?

    Are the missing values significant enough to handle it?

    We will answer the above set of questions in due time.

    So, let us first see all the missing values of all the features of the Boston Crime Dataset (Figure 1.4).

    Figure 1.4: Missing value count for all columns

    Let us consider the variables SHOOTING and STREET and find the count of missing values.

    Figure 1.5: Frequency table for SHOOTING

    As we can see, the feature SHOOTING is a Boolean field (Figure 1.5) from the details of the features and the data, so the values can be either True or False. Either shooting can take place, or it won’t take place. As per the missing value report, we can see that around 1723 records have True/Yes rest all the records are missing. But the good thing about the Boolean feature is if something is missing, then it will be either True or False. In the case of SHOOTING, the missing values will be No/False as we already have some value for Yes/True.

    As we know how to tackle the SHOOTING feature we should proceed with it.

    You can see in Figure 1.6 that we have first replaced all the missing values with N as we concluded before that in this case of NaN, it is the same as N.

    Figure 1.6: Code for replacing the missing values

    After confirming that there are no missing values anymore, we then replaced the Y and N with True and False, respectively, as we know that to represent a Boolean value, we either need 1/0 or True/False.

    Figure 1.7: Frequency table for STREET

    But if we look at the other feature, i.e., STREET (Figure 1.7), we can see that it is not a binary value. So, figuring out the right missing value is next to impossible. In these cases, we need to use a different strategy to tackle this problem. Individually, finding the solution for columns can be fruitful, but here we will make a general approach to solve the missing value.

    Similarly, we need to find the missing values for all the columns and handle them in a generic pattern.

    As we can see from the above figure (Figure 1.4) features in Figure 1.5 are missing with their respective missing count.

    Now we have a list of the features which have missing values and also their respective dtypes. Seeing the data and the contents, we can design a strategy to fill the missing values.

    In this chapter, we will discuss a few of the techniques for missing values. While designing the strategy, we need to take care of all the data types like numeric, string/object, etc. and fill them logically.

    Let us take a look at another unique column and a unique way to handle the missing value. How should we fill the missing values for the column Lat and Long? (refer to Figure 1.2 and Figure 1.3).

    Suppose we fill the integer missing values with the mean or median of that feature column — that can be one of the strategies. Still, in this case, we see Lat and Long are the coordinates, and therefore, filling it with the mean value of Lat and Long column respectively may give a location which is not in Boston, and it won’t be accurate to perform that action. Filling it with the median value won’t give us any wrong location, but it will increase the rate of the Crime Incident by 27,378 for that median

    Enjoying the preview?
    Page 1 of 1