0% found this document useful (0 votes)
3 views

Features

Uploaded by

tealice18
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Features

Uploaded by

tealice18
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

FEATURES

o A feature is an attribute of a data set that is used in a machine learning process.


o There is a view amongst certain machine learning practitioners that only those attributes which are
meaningful to a machine learning problem are to be called as features.
o In fact, selection of the subset of features which are meaningful for machine learning is a sub-area
of feature engineering.
o The features in a data set are also called its dimensions. So, a data set having ‘n’ features is called
an n-dimensional data set.
o Example:

Types of Feature
There are three distinct types of features: quantitative, ordinal, and categorical. We can also consider a
fourth type of feature — the Boolean — as this type does have a few distinct qualities, although it is
actually a type of categorical feature. These feature types can be ordered in terms of how much
information they convey. Quantitative features have the highest information capacity followed by ordinal,
categorical, and Boolean.
Let's take a look at the tabular analysis:

Feature type Order Scale Tendency Dispersion Shape

Quantitative Yes Yes Mean Range, variance, and Skew ness,


standard deviation kurtosis

Ordinal Yes No Median Quantiles NA

Categorical No No Mode NA NA
Feature Construction
o Feature construction involves transforming a given set of input features to generate a new set of
more powerful features.
o To understand more clearly, let’s take the example of a real estate data set having details of all
apartments sold in a specific region.
o The data set has three features –apartment length, apartment breadth, and price of the apartment.
o If it is used as an input to a regression problem, such data can be training data for the regression
model.
o So given the training data, the model should be able to predict the price of an apartment whose
price is not known or which has just come up for sale.
o However, instead of using length and breadth of the apartment as a predictor, it is much
convenient and makes more sense to use the area of the apartment, which is not an existing
feature of the data set.
o So, we transform the three-dimensional data set to a four-dimensional data set, with the newly
‘discovered’ feature apartment area being added to the original dataset.

Feature Transformation
Feature transformation is simply a function that transforms features from one representation to another.
But why would we transform our features? Well there are many reasons, such as:
1. Data types are not suitable to be fed into a machine learning algorithm, e.g. text, categories
2. Feature values may cause problems during the learning process, e.g. data represented in different
scales
3. We want to reduce the number of features to plot and visualize data, speed up training or improve
the accuracy of a specific model

In this article we will focus on three main transformation techniques:


o Handling categorical variables
o Feature scaling
o Principal Component Analysis
Handling Categorical Variables
Categorical Data
Categorical values are values that can be represented as categories or groups. They can be grouped into
two main types: Nominal and Ordinal.
o Nominal values are simply names or labels with no ordering defined. Example: gender, color.
o Ordinal values are categories where order does matter. Example: t-shirt size, rank, grade.

Machine learning algorithms cannot handle data represented as categories or labels. Therefore, we need
to transform the values into a more relevant format.

Dataset
We will work with a very simple dataset so that we put total focus on the techniques we are going to learn.
The dataset is simply a CSV file with two fields: ID and Color.
Setting up the Environment
For the rest of the article, the following steps are common to setup the development environment:
1. Open a Jupyter notebook
2. Import find spark and initialize it
3. Create a spark session
4. Load and show the data
Program:
import findspark
findspark.init('/opt/spark')
from pyspark.sql importSparkSession
spark = SparkSession.builder.getOrCreate()
data = spark.read.csv('./datasets/colors.csv', header=True, inferSchema=True) data.show()

Data
Even though the data is very simple, we cannot work with the color column as it is, since it contains
categorical data.
In order to solve this problem, we will introduce two main methods and how implement them in Spark ML:
StringIndexing and OneHotEncoding.
String Indexing
The concept behind String Indexing is very intuitive. We simply replace each category with a number. Then
we use this number in our models instead of the label.
Here is how we do it. First, we need to define a StringIndexer.
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="color", outputCol="color_indexed")
Note that indexer here is an object of type Estimator.
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data.
Technically, an Estimator implements a method fit(), which accepts a Data Frame and produces a Mode
which is a Transformer.
The objective of an estimator here is to learn the mappings from a color label to a color index.
Next, we call the fit()method to initiate the learning process.
indexer_model = indexer.fit(data)
The returned indexer_model is an object of type Transformer.
A Transformer is an abstraction that includes feature transformers and learned models. It implements a
method transform(), which converts one Data Frame into another, generally by appending one or more
columns.
After fitting the estimator and getting our transformer, it is time to use it on our data by calling transform().
indexed_data= indexer_model.transform(data)
# to view the data
indexed_data.show()
Notice how a new column “color_indexed” is added as specified in our output Column field.

The new column represents an index for each color value. Similar color values have similar indices. Here we
see that red, white, orange and blue where given the numbers 0, 1 , 2 and 3 respectively.
These numbers will be the ones collected in the features vector with the VectorAssembler to be passed to
the machine learning algorithm.
But wait! We still have a problem. A color is a nominal value not an ordinal one. This means that there is no
order between the color names. For example: red is not greater, less than or equal to green. However,
based on the current representation the machine learning model may consider somehow that there is an
order based on the values given. Don’t worry we will fix this with another technique called One Hot
Encoding.
One Hot Encoding
We use One Hot Encoding (OHE) to break the ordering within a categorical column. The process to apply
OHE is the following:
1. Break the categorical column into n different columns, where n is the number of unique
categories in the column
2. Assign a binary value (0 or 1 ) in each column that represents the existence of the color in the data
point
Going back to our example, we have four unique colors: red, white, orange and blue. Therefore, we need
four columns. We will name them: is_red, is_white, is_orange and is_blue. Now instead of having a value
x for the color red we will put 1 in the is_red column and 0 in the others. Then, we will group the values in
an array to be used as the color feature instead of the single-value index calculated by StringIndexer. [See
the table below to get a better idea].

You might also like