0% found this document useful (0 votes)
23 views42 pages

03_Building_Your_First_Dataset.ipynb - Colab

This document outlines the learning objectives for building a dataset in PyTorch, including loading and manipulating tensors, basic data preprocessing, and dataset creation. It explains the concepts of scalars, vectors, matrices, and tensors, along with their creation and manipulation in PyTorch. Additionally, it covers the relationship between PyTorch tensors and NumPy arrays, reshaping tensors, and the importance of tensor contiguity in memory.

Uploaded by

whizbainz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views42 pages

03_Building_Your_First_Dataset.ipynb - Colab

This document outlines the learning objectives for building a dataset in PyTorch, including loading and manipulating tensors, basic data preprocessing, and dataset creation. It explains the concepts of scalars, vectors, matrices, and tensors, along with their creation and manipulation in PyTorch. Additionally, it covers the relationship between PyTorch tensors and NumPy arrays, reshaping tensors, and the importance of tensor contiguity in memory.

Uploaded by

whizbainz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

2/5/25, 11:01 AM 03_Building_Your_First_Dataset.

ipynb - Colab

Learning Objectives

By the end of this chapter, you should be able to:

1. Load and manipulate tensors in PyTorch, sending them to different devices

2. Perform basic preprocessing on your data, such as standardizing continuous attributes

3. Build a dataset out of tensors

4. Split your dataset into mini-batches using data loaders

Tensors, Devices & CUDA


In Deep Learning, we see tensors everywhere. But, what is a Tensor, anyway?

Before answering this (in the context of deep learning models), let's take a step back and learn
the difference between scalars, vectors, and multi-dimensional arrays such as matrices. Since
we'll be using tabular data to train our first model, let's draw analogies from a spreadsheet.

keyboard_arrow_down Scalars
A single value is called a scalar.

Let's create a scalar in PyTorch:

import torch
scalar = torch.tensor(18)
scalar

tensor(18)

keyboard_arrow_down Vectors
A list or one-dimensional array of values, like a single column in a spreadsheet, is called a
vector.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 1/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Let's create a vector in PyTorch:

vector = torch.tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
vector

tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])

keyboard_arrow_down Matrices
A two-dimensional array of values, like a table in a spreadsheet, is called a matrix.

Let's create a matrix in PyTorch (we'll use just two columns, mpg and horsepower, to keep it
simple):

matrix = torch.tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150,
matrix

tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150, 225]])

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 2/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Of course, you'll never have to type in the values from a spreadsheet. We'll conveniently load the
values directly from the file, first using pandas, and then using PyTorch's own data pipes. We'll
get back to it in the "Datasets" section.

keyboard_arrow_down Tensors
A three-dimensional array of values, like a collection of spreadsheets, each containing data for a
given month, is called a tensor.

From then on, be it four or forty-two dimensions, a multi-dimensional array is called a tensor. So,
technically speaking, if an array has three or more dimensions, it is a tensor.

You can easily create tensors in PyTorch using the tensor() method to create either a scalar or a
tensor, as we've been doing in the examples provided on the previous pages. Moreover, there are
methods to create tensors filled with ones, zeros, or random numbers: ones(), zeros(), rand(),
and randn() to name a few.

matrix_of_ones = torch.ones((2, 3), dtype=torch.float)


random_tensor = torch.randn((2, 3, 4), dtype=torch.float)
matrix_of_ones, random_tensor

(tensor([[1., 1., 1.],


[1., 1., 1.]]),
tensor([[[ 0.5540, -0.5200, 0.2645, 0.0977],
[ 1.1350, 1.0698, -0.9284, -0.6075],
[ 0.0084, 0.0668, 0.7904, 0.5460]],

[[-0.8818, 0.8938, 0.0853, -0.4218],


[ 0.8283, 1.0426, 1.5458, 1.3809],
[ 0.6804, -0.5616, -0.0655, -1.6407]]]))

You can get the shape of a tensor using its shape attribute, but PyTorch also implements a
size() method that accomplishes the same thing.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 3/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

vector.shape, vector.size()

(torch.Size([14]), torch.Size([14]))

As expected, the shape of a scalar is an empty list since scalars are dimensionless (zero
dimensions).

matrix.size(), scalar.size()

(torch.Size([2, 14]), torch.Size([]))

While scalars are single numbers, thus having zero dimensions, one- and two-dimensional
arrays are called vectors and matrices, respectively, as we've seen in the examples above. But, in
order to make matters simple, it is commonplace to refer to any array with one or more
dimensions as a tensor.

In summary, everything is either a scalar or a tensor. There are tensors for data, and tensors for
parameters. Right now, we're dealing with the former, and we'll move on to the latter later in the
next chapter.

keyboard_arrow_down Numpy
NumPy brings the computational power of languages like C and Fortran to Python, a language
much easier to learn and use. Thanks to its performance, Numpy sits at the core of many
machine and deep learning libraries such as Scikit-Learn, Scipy, Pandas, and Matplotlib. For this
reason, it is fairly common to load tabular data from other sources, such as CSV or Excel files,
into a collection of Numpy arrays. Even when dealing with images, pixel values are often stored
inside Numpy arrays.

PyTorch tensors and Numpy arrays have a lot in common. You may create Numpy arrays using
its identically-named methods such as zeros(), ones(), rand(), and randn(), for example.

Moreover, you can easily switch between the two of them, arrays and tensors, using PyTorch's
numpy() and as_tensor() methods. The former converts a PyTorch tensor into a Numpy array,
while the latter creates a PyTorch tensor out of a Numpy array. Let's see them in action.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 4/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Numpy array:

import numpy as np
numpy_array = vector.numpy()
numpy_array

array([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])

PyTorch tensor:

back_to_tensor = torch.as_tensor(numpy_array)
back_to_tensor

tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])

There's one caveat, though: only "CPU" tensors can be converted into Numpy arrays. Every
tensor we created thus far is, by default, a "CPU" tensor. We'll learn about different types of
tensors shortly, in the "Devices" section.

Reshaping Tensors
One of the most common operations you'll need to perform is to reshape a tensor into a
different, well, shape!

One typical case, especially in computer vision, is to convert a multi-dimensional tensor


representing features into a single sequence of features. The figure below illustrates this:

There are two data points, and their corresponding features are organized in a two-by-three
shape in the tensor at the top. In order to use these features to train a linear or a logistic
regression, however, you'd need to have the features lined up instead. The flattened tensor at the
bottom represents this, the flattened version of both tensors.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 5/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Although the operation itself is quite simple, there are a few pitfalls you need to avoid while
reshaping your tensors. Let's go over a few examples.

keyboard_arrow_down Reshaping Tensors: Avoiding Copies


In PyTorch, you can reshape a tensor using its view() or reshape() methods. The latter may or
may not create a copy, so the former is preferred, since it doesn't make copies of the data.

original_tensor = torch.ones((2, 3), dtype=torch.float)


reshaped_tensor = original_tensor.view(1, 6)
original_tensor, reshaped_tensor

(tensor([[1., 1., 1.],


[1., 1., 1.]]),
tensor([[1., 1., 1., 1., 1., 1.]]))

keyboard_arrow_down Reshaping Tensors: Sharing Underlying Data


The view() method only returns a tensor with the desired shape that happens to share the
underlying data with the original tensor. It does not create a new, independent, tensor. This
means that, if you make changes to one of the two tensors, the original, or the reshaped one,
these changes will be reflected in both of them.

original_tensor[0, 1] = 2
original_tensor, reshaped_tensor

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 6/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

(tensor([[1., 2., 1.],


[1., 1., 1.]]),
tensor([[1., 2., 1., 1., 1., 1.]]))

Moreover, if you created your tensor from a Numpy array, the two of them, array and tensor, are
also sharing the underlying data.

numpy_array[-1] = 1000
numpy_array, vector

(array([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15,
14, 15, 1000]),
tensor([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14,
15, 1000]))

In order to effectively duplicate the data and create a new, independent, tensor, you can use the
clone() method instead.

cloned_tensor = original_tensor.clone()
cloned_tensor

tensor([[1., 2., 1.],


[1., 1., 1.]])

Now, if you make changes to the original tensor, they won't be reflected in the new tensor
anymore.

original_tensor[0, 1] = 3
original_tensor, cloned_tensor

(tensor([[1., 3., 1.],


[1., 1., 1.]]),
tensor([[1., 2., 1.],
[1., 1., 1.]]))

keyboard_arrow_down Reshaping Tensors: Contiguous Tensors


The view() method is a convenient way of reshaping a tensor, but it may fail if the underlying
tensor is not contiguous in memory.

transposed_tensor = original_tensor.t()
transposed_tensor.view(1, 6)

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 7/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-40-13ca45c67fa7> in <cell line: 2>()
1 transposed_tensor = original_tensor.t()
----> 2 transposed_tensor.view(1, 6)

RuntimeError: view size is not compatible with input tensor's size and stride (at
least one dimension spans across two contiguous subspaces). Use .reshape(...)
instead.

Remember, view() does not make any copies of the data, but reshape() does, so it will always
work, even if the tensors are not contiguous.

But, what does it mean to be contiguous? Simply put, it means two elements in the same row
must be next to each other in memory. This is always the case whenever a tensor is created
(like our original_tensor), but once we transpose it, we're not actually changing its allocation in
memory. Transposing, in this case, means traversing it differently, that is, jumping to a different
position in memory.

Contiguous vs Non-Contiguous Tensors

We can see the "rules" for moving to the next row or column by checking the tensor's stride
method:

original_tensor.stride(), transposed_tensor.stride()

((3, 1), (1, 3))

In the original tensor, the stride is telling us that we need to skip three positions in memory to
get to the next row, while only one position for the next column. But, in the transposed tensor, it
is the other way around: we need to skip three positions to get to the next column.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 8/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

If we need to skip two or more positions to get to the next column, it means our tensor is not
contiguous anymore. Let's check it out:

transposed_tensor.is_contiguous(), original_tensor.is_contiguous()

(False, True)

Transposed tensor:

transposed_tensor

tensor([[1., 1.],
[3., 1.],
[1., 1.]])

Luckily, you can simply call the contiguous() method, and PyTorch will modify the data in
memory in such a way the data can be traversed in its typical fashion (a stride of one in the last
dimension). If the underlying data happens to be contiguous already, this is a zero-cost
operation.

transposed_tensor.contiguous().view(1, 6)

tensor([[1., 1., 3., 1., 1., 1.]])

Making a Tensor Contiguous

Finally, it is also possible to use the flatten() method instead, in case you're trying to make your
tensor one-dimensional.

transposed_tensor.flatten()

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 9/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

tensor([1., 1., 3., 1., 1., 1.])

Don't worry much about memory allocation, though. The purpose of this section was to make
you aware of and capable of addressing the error message at the top, should you run into it by
any chance.

Named Tensors

Named tensors are a long-awaited feature, even if they're still a prototype feature. Many, if not
most, implementation bugs - even worse, the silent kind of bug - in deep learning models arise
from the fact that the wrong dimensions are being used in a given operation.

You may be wondering how is it possible that such a serious bug may be a silent one, that is,
one that does not raise an exception and crashes the application?

In many cases, broadcasting is to blame. Broadcasting is both a blessing and a curse. While it
makes it extremely easy to perform operations using tensors of different shapes without the
need to explicitly replicate data along some dimension, it may also give you the illusion your
operation is the right one, even when it's not because you messed up the dimensions.

keyboard_arrow_down Named Tensors: Broadcasting


Broadcasting happens whenever you're trying to perform an operation on tensors of different
shapes. For example, if you try to multiply a one-dimensional tensor by a scalar (zero
dimensions):

a = np.array([1.0, 2.0, 3.0])


b = 2.0
a * b

array([2., 4., 6.])

You've probably done similar operations many times without giving a second thought to why it
works so seamlessly. As it turns out, you have broadcasting to thank for this behavior. Under the
hood, PyTorch (or Numpy) will "stretch" the variable b so its shape matches that of variable a,
thus allowing the desired element-wise multiplication.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 10/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Moreover, it is actually more efficient to use broadcasting like that than building a tensor full of
2.0s to match the shapes!

Let's go over an example:

mat1 = torch.ones((3, 3))


mat2 = torch.tensor([[1, 2, 3]])
mat1, mat2

(tensor([[1., 1., 1.],


[1., 1., 1.],
[1., 1., 1.]]),
tensor([[1, 2, 3]]))

What if we'd like to perform an element-wise multiplication? Broadcasting got us covered, it will
"understand" that mat2 was "meant" to be 3x3 instead.

mat1 * mat2

tensor([[1., 2., 3.],


[1., 2., 3.],
[1., 2., 3.]])

Broadcasting works by comparing dimensions of both tensors from right to left, and it will
"match" them if they are equal or one of them is one (so that particular value will be replicated
along that dimension). In the example above, these are the dimensions:

mat1.size(), mat2.size()

(torch.Size([3, 3]), torch.Size([1, 3]))

The right-most dimension is 3 for both tensors, so it is matched. Moving to the left, the first
dimension of one tensor is 1, so it is also matched. There we go, broadcasting can work its
magic! But, beware, if you were to transpose the second tensor (mat2) by mistake, broadcasting
still works!

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 11/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

mat2_wrong_shape = mat2.t()
mat1 * mat2_wrong_shape

tensor([[1., 1., 1.],


[2., 2., 2.],
[3., 3., 3.]])

What does this mean? It means that, if you transposed one of the tensors by mistake, it may still
produce a valid output. If you think it's unlikely that you'll ever get the dimensions in the wrong
order, think again: when it comes to tensors representing batches of images or sequences, it
isn't so uncommon to mix dimensions up.

Luckily, named tensors can help us keep broadcasting eagerness in check.

named_mat1 = torch.ones((3, 3), names=['R', 'C'])


named_mat2 = torch.tensor([[1, 2, 3]], names=['R', 'C'])

<ipython-input-51-607113069adf>:1: UserWarning: Named tensors and all their associate


named_mat1 = torch.ones((3, 3), names=['R', 'C'])

Input:

named_mat1 * named_mat2

tensor([[1., 2., 3.],


[1., 2., 3.],
[1., 2., 3.]], names=('R', 'C'))

All is good and well, rows and columns are well aligned, and the result is as expected. Also,
notice that the names are propagated to the resulting tensor.

Now, what happens if we transpose (supposedly by mistake) the second matrix?

named_mat1 * named_mat2.t()

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-53-7deb473f8df5> in <cell line: 1>()
----> 1 named_mat1 * named_mat2.t()

RuntimeError: Error when attempting to broadcast dims ['R', 'C'] and dims ['C',
'R']: dim 'C' and dim 'R' are at the same position from the right but do not match.

Great, we got an error! Even though broadcasting would happily return a 3x3 matrix, the
misalignment of the dimensions' names prevented that and rightfully raised an exception
warning us of our mistake.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 12/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Of course, it only works if both tensors are named. If one of them isn't named, broadcasting
keeps working as expected.

Start coding or generate with AI.

named_mat1 * mat2.t()

tensor([[1., 1., 1.],


[2., 2., 2.],
[3., 3., 3.]], names=('R', 'C'))

keyboard_arrow_down Devices
So far, all the tensors we have created are "CPU" tensors. It means the tensor is stored in the
computer's main memory and any operations performed on these tensors are handled by its
central processing unit, the CPU (e.g. an Intel Core i9 processor). The type of tensor is
designated by the device, a CPU in this case, that handles its operations.

We can easily check the device responsible for a given tensor by checking its device attribute:

device = original_tensor.device
device

device(type='cpu')

But the CPU is not the only device we can use to manipulate tensors. We can also use graphics
processing units (GPUs), tensor processing units (TPUs) or even "meta" (fake) devices. Let's
take a look at them!

keyboard_arrow_down Devices: GPU


Graphics processing cards are a powerful tool in the deep learning practitioner's toolbelt. They
were originally designed for gamers, and they are especially fast in handling matrix
multiplication at scale, since this is the most common operation performed for rendering the 3D
scenes of a game.

It turns out, though, that matrix multiplication at scale can also be used to train deep learning
models. Initially, it wasn't easy to leverage their power for that purpose since programming a
GPU was quite challenging. It was NVIDIA's release of CUDA (Compute Unified Device
Architecture) and, later on, AMD's ROCm (Radeon Open Compute Ecosystem) that allowed deep
learning frameworks such as PyTorch to more easily use them to dramatically speed up training
times.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 13/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

GPUs are freely available on most platforms, such as Google Colab and Kaggle, and you should
always check the availability of a GPU before starting training a model. Since these platforms
offer CUDA-compatible GPUs, we'll be focusing solely on them.

PyTorch makes it really easy to accomplish that: you only have to make a call to
torch.cuda.is_available() and name your device accordingly.

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Once we specify a device, we can send our tensor to it using the aptly named method to():

sent_tensor = original_tensor.to(device)
sent_tensor.device

device(type='cpu')

If a GPU is not available, nothing will happen, and calling to() comes at no cost. So, it's safe to
always send your tensors (and later on, your models) to the specified device. This way, if you
share your code with someone else, or if you happen to run it in a different environment in the
future, your code will always leverage the power of a GPU, if one is available to you.

If a GPU is indeed available, the tensor's device will read cuda:0, as it now resides in the memory
of the first (and in most cases, only) GPU available. Moreover, if you check the tensor's type, it
will read torch.cuda.FloatTensor.

sent_tensor.type()

'torch.FloatTensor'

If you're lucky enough to have multiple GPUs at your disposal, you can check how many are
available to you, and their corresponding names using torch.cuda.device_count() and
torch.cuda.get_device_name(), respectively:

n_cudas = torch.cuda.device_count()
for i in range(n_cudas):
print(torch.cuda.get_device_name(i))

Once a tensor is sent to a GPU, it cannot be directly brought back to Numpy anymore.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 14/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

sent_tensor.numpy()

array([[1., 3., 1.],


[1., 1., 1.]], dtype=float32)

You need to bring them back to the CPU first (either using to('cpu') or cpu()), and only then call
the numpy() method.

sent_tensor.cpu().numpy()

array([[1., 3., 1.],


[1., 1., 1.]], dtype=float32)

Devices: TPU
Unlike GPUs, which were originally designed for gamers, TPUs - tensor processing units - as
their name suggests, were designed by Google to be used for training deep learning models in
TensorFlow.

TPUs are available on some platforms, such as Google Colab and Kaggle. Although they have
been designed to work with TensorFlow, it's also possible to leverage their immense power in
PyTorch using PyTorch/XLA, a package that connects PyTorch to Google's XLA (accelerated
linear algebra) library.

Unfortunately, TPUs aren't freely available in Google Colab anymore.

TPUs can be used to speed up training even more by using all its cores at once through
multiprocessing.

keyboard_arrow_down Devices: "Meta" (Fake)


The "meta" device is an elegant solution for a problem you may run into if your models grow
really large. After training a model, you may save it to disk, and later load it back as the backend
of an application, for example (we'll do that later).

In order to load a model from disk, though, you need to create an instance of the untrained
model first, so you have where to load the model into. Meta devices allow you to create
"dummy", empty, models that would be too large to fit in memory, thus making it possible to
work around hardware constraints by only partially loading a model.

meta_tensor = torch.zeros(2, 3, device='meta')


meta_tensor

tensor(..., device='meta', size=(2, 3))

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 15/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Mission accomplished! The fake tensor does not contain any data, as expected. Now let's create
a REALLY huge fake tensor.

huge_tensor = torch.zeros(100000, 100000, device='meta')


huge_tensor

tensor(..., device='meta', size=(100000, 100000))

The tensor above, should it be a real tensor, would have 10 BILLION 32-bit float elements. Let's
see what happens if we try to create a regular tensor of the same size.

#huge_tensor = torch.zeros(100000, 100000)

Unless your computer has over 40 gigabytes of free RAM, you'll get an error. Fake tensors are
useful to handle tensors - and models - that are too large to fit into memory.

We won't be using these tensors in this course, but if you venture into using really large models,
you're already aware of your options.

keyboard_arrow_down Datasets

It is time to get our hands a little dirty with some tiny, yet real, data. Let's start by loading the
Auto MPG Dataset directly from the UCI Machine Learning Repository using pandas' read_csv()
method and a URL.

Its description reads: "The data concerns city-cycle fuel consumption in miles per gallon, to be
predicted in terms of 3 multivalued discrete and 5 continuous attributes."

In this dataset, values are separated by spaces, missing values are represented by a question
mark. The columns, or attributes, as stated in the repository, are as follows:

mpg: continuous
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 16/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

weight: continuous
acceleration: continuous
model year: multi-valued discrete
origin: multi-valued discrete
car name: string (unique for each instance)

The last column, car name, is actually separated by tabs (instead of spaces), so we're
considering the cars' names as comments while loading the dataset.

Pandas

To load tabular data, such as CSV or Excel files, one of the most popular choices is the Pandas
package, an open-source data analysis and manipulation tool. Pandas' strength lies in its
dataframes, a spreadsheet-like structure that contains two-dimensional data and its
corresponding labels. A dataframe is composed of a sequence of series, each series
representing a column, its values stored as Numpy arrays.

You can use methods such as read_csv() and read_excel() to load your data, each method
offering plenty of arguments to account for different separators, the existence or not of column
headers, comments, missing data, and more. We'll be using the former to load our dataset.

import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cyl', 'disp', 'hp', 'weight', 'acc', 'year', 'origin']
df = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipiniti
df

mpg cyl disp hp weight acc year origin

0 18.0 8 307.0 130.0 3504.0 12.0 70 1

1 15.0 8 350.0 165.0 3693.0 11.5 70 1

2 18.0 8 318.0 150.0 3436.0 11.0 70 1

3 16.0 8 304.0 150.0 3433.0 12.0 70 1

4 17.0 8 302.0 140.0 3449.0 10.5 70 1

... ... ... ... ... ... ... ... ...

393 27.0 4 140.0 86.0 2790.0 15.6 82 1

394 44.0 4 97.0 52.0 2130.0 24.6 82 2

395 32.0 4 135.0 84.0 2295.0 11.6 82 1

396 28.0 4 120.0 79.0 2625.0 18.6 82 1

397 31.0 4 119.0 82.0 2720.0 19.4 82 1

398 rows × 8 columns

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 17/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

A dataframe can be easily sliced, both column- and row-wise. Retrieving values from a single
column (thus resulting in a Pandas series) is as simple as that:

df['mpg']

mpg

0 18.0

1 15.0

2 18.0

3 16.0

4 17.0

... ...

393 27.0

394 44.0

395 32.0

396 28.0

397 31.0

398 rows × 1 columns

dtype: float64

A Pandas series works as a wrapper around the underlying Numpy array that contains its data,
which you can retrieve using the values attribute:

df['mpg'].values[:5]

array([18., 15., 18., 16., 17.])

Selecting multiple columns will return a sliced dataframe:

df[['mpg', 'hp']]

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 18/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

mpg hp

0 18.0 130.0

1 15.0 165.0

2 18.0 150.0

3 16.0 150.0

4 17.0 140.0

... ... ...

393 27.0 86.0

394 44.0 52.0

395 32.0 84.0

396 28.0 79.0

397 31.0 82.0

398 rows × 2 columns

The dataframe itself also has its own values attribute, which will give you access to a two-
dimensional Numpy array containing the whole data:

df[['mpg', 'hp']].values[:5]

array([[ 18., 130.],


[ 15., 165.],
[ 18., 150.],
[ 16., 150.],
[ 17., 140.]])

To subset the dataframe, you can use its iloc attribute, which allows for selecting rows based on
their index:

df.iloc[:5]

mpg cyl disp hp weight acc year origin

0 18.0 8 307.0 130.0 3504.0 12.0 70 1

1 15.0 8 350.0 165.0 3693.0 11.5 70 1

2 18.0 8 318.0 150.0 3436.0 11.0 70 1

3 16.0 8 304.0 150.0 3433.0 12.0 70 1

4 17.0 8 302.0 140.0 3449.0 10.5 70 1

It is also possible to use a boolean Series to conditionally subset the rows of a dataframe:
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 19/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

cond = (df['year'] == 70)


df[cond]

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 20/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

mpg cyl disp hp weight acc year origin

0 18.0 8 307.0 130.0 3504.0 12.0 70 1

1 15.0 8 350.0 165.0 3693.0 11.5 70 1

2 18.0 8 318.0 150.0 3436.0 11.0 70 1

3 16.0 8 304.0 150.0 3433.0 12.0 70 1

4 17.0 8 302.0 140.0 3449.0 10.5 70 1

5 15.0 8 429.0 198.0 4341.0 10.0 70 1

6 14.0 8 454.0 220.0 4354.0 9.0 70 1

7 14.0 8 440.0 215.0 4312.0 8.5 70 1

8 14.0 8 455.0 225.0 4425.0 10.0 70 1

9 15.0 8 390.0 190.0 3850.0 8.5 70 1

10 15.0 8 383.0 170.0 3563.0 10.0 70 1

11 14.0 8 340.0 160.0 3609.0 8.0 70 1

12 15.0 8 400.0 150.0 3761.0 9.5 70 1

13 14.0 8 455.0 225.0 3086.0 10.0 70 1

14 24.0 4 113.0 95.0 2372.0 15.0 70 3

15 22.0 6 198.0 95.0 2833.0 15.5 70 1

16 18.0 6 199.0 97.0 2774.0 15.5 70 1

17 21.0 6 200.0 85.0 2587.0 16.0 70 1

18 27.0 4 97.0 88.0 2130.0 14.5 70 3

19 26.0 4 97.0 46.0 1835.0 20.5 70 2

20 25.0 4 110.0 87.0 2672.0 17.5 70 2

21 24.0 4 107.0 90.0 2430.0 14.5 70 2

22 25.0 4 104.0 95.0 2375.0 17.5 70 2

23 26.0 4 121.0 113.0 2234.0 12.5 70 2

24 21.0 6 199.0 90.0 2648.0 15.0 70 1

25 10.0 8 360.0 215.0 4615.0 14.0 70 1

26 10.0 8 307.0 200.0 4376.0 15.0 70 1

27 11.0 8 318.0 210.0 4382.0 13.5 70 1

28 9.0 8 304.0 193.0 4732.0 18.5 70 1

Train-Validation-Test Split
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 21/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

The purpose of the split is to simulate the arrival of new data, so you can make adjustments to
your model if training doesn't go well and, once you're happy with it, to make a final assessment
before going live with it. Each split, training, validation, and test, has its own purpose, as
described below. It is also important to highlight that the split should always be the first thing
you do—no preprocessing, no transformations; nothing happens before the split.

Training Set: the data you use to train your model - you can use and abuse this data!
Validation Set: the data you should only use for hyper-parameter tuning, that is, comparing
differently parameterized models trained on the training data, to decide which parameters
are best. You should use, but not abuse this data, as it is intended to provide an unbiased
evaluation of your model and, if you mess around with it too much, you'll end up
incorporating knowledge about it in your model without even noticing.
Test Set: the data you should use only once, when you are done with everything else, to
check if your model is still performing well. We like to pretend this is data from the "future"
- that particular day in the future when our model is ready to give it a go in the real world!
So, until that day, we cannot know this data, as the future hasn't arrived yet.

keyboard_arrow_down Train-Validation-Test Split: Shuffling


In most cases, you will need to shuffle the data - the rows in a dataframe - before the split, so
your data isn't in any particular order anymore. Perhaps you noticed that our Auto MPG dataset
is ordered by year:

df['year'].values[:50]

array([70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71,
71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71])
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 22/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Although it may make sense if you're handling data in a spreadsheet (making it easier for you to
look something up in it), a predefined ordering is potentially an issue for training and evaluating
a model. Ideally, we would like to have all our sets - training, validation, and test - containing
similar information. Taking the example of the year column, we'd like to have cars from 1970 to
1982 in all three sets. The easiest and fastest way to accomplish that is to simply shuffle the
data first.

We can use the sample() method of a Pandas dataframe to sample, that is, to draw data points
from the dataframe in random order. The trick here is to draw the whole dataset using sample
(frac=1), so we got our full dataframe back, but in a different order. The resulting dataframe still
has its original index values, but we can easily drop it using the reset_index() method.

shuffled = df.sample(frac=1, random_state=1).reset_index(drop=True)

To actually perform the split, we'll use Scikit-Learn's train_test_split() twice, once for splitting the
data into train and test sets, and then to subdivide the training data into train and validation sets.

from sklearn.model_selection import train_test_split


trainval, test = train_test_split(shuffled, test_size=0.16, shuffle=False)
train, val = train_test_split(trainval, test_size=0.2, shuffle=False)

keyboard_arrow_down Cleaning Data

Ensuring the quality and consistency of your data is of the utmost importance. The most basic
checks you can do are looking for missing values and outliers in your data. Let's start with the
latter. Outliers are values that are, literally, "off the charts": they may be produced by a
measurement or input (e.g. typing) errors, in which case they are not real and thus must be
handled; but they may also be legitimate, sometimes indicating an anomaly, in which case they
may be exactly the target of your model. Outliers of the first kind, the errors, may affect model
training negatively and badly skew its predictions. There are many techniques for removing
outliers but we won't be delving into this topic here.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 23/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

Unlike outliers, missing values are easy to spot, they show up as NaN (Not a Number) in a
dataframe. In Deep Learning models, NaN values propagate like an infectious disease: any
operation between an actual number and a NaN value results in another NaN value.

The process of "fixing" missing values is called imputation, that is, replacing the missing data
with substituted values. There are many techniques to accomplish this, from using the mean or
median of the corresponding column in the training data, to more sophisticated approaches
using other Machine Learning algorithms to "predict" the missing value.

Any imputation is based on assumptions you make about its nature, so it will necessarily lead to
slightly modifying the data distribution. Alternatively, if you can afford to lose some data points,
it's also possible to simply discard any data points containing missing values.

Let's check our data for missing values:

is_missing_attr = train.isna()
n_missing_attr = is_missing_attr.sum(axis=1)
train[n_missing_attr > 0]

mpg cyl disp hp weight acc year origin

89 34.5 4 100.0 NaN 2320.0 15.8 81 2

208 25.0 4 98.0 NaN 2046.0 19.0 71 1

211 40.9 4 85.0 NaN 1835.0 17.3 80 2

There are three cars with missing horsepower information in our training set. While tree-based
algorithms such as Random Forests (RF) or Gradient-Boosted Trees (GBT) can easily handle
missing data, missing values are a big no-no when it comes to neural networks.

In order to keep things simple in our small example, let's simply drop any rows that contain
missing values.

train.dropna(inplace=True)
train

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 24/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

mpg cyl disp hp weight acc year origin

0 18.0 6 171.0 97.0 2984.0 14.5 75 1

1 28.1 4 141.0 80.0 3230.0 20.4 81 2

2 19.4 8 318.0 140.0 3735.0 13.2 78 1

3 20.3 5 131.0 103.0 2830.0 15.9 78 2

4 20.2 6 232.0 90.0 3265.0 18.2 79 1

... ... ... ... ... ... ... ... ...

262 26.0 4 91.0 70.0 1955.0 20.5 71 1

263 26.4 4 140.0 88.0 2870.0 18.1 80 1

264 31.9 4 89.0 71.0 1925.0 14.0 79 2

265 19.2 8 267.0 125.0 3605.0 15.0 79 1

266 33.0 4 91.0 53.0 1795.0 17.5 75 3

264 rows × 8 columns

Then, let's do the same for our validation and test sets. If we had chosen to perform missing
value imputation, we would have to apply the same rules used in the training set for the
validation and test sets as well.

val.dropna(inplace=True)
test.dropna(inplace=True)

Beware of Data Leakage!

You should never use the validation or test sets as a source for any kind of data preprocessing
(such as imputing data). Using statistics computed on the validation or test sets are akin to
using statistics from the future, that is, computed on the data your users will eventually send to
your application or model. Obviously, you cannot know these values beforehand, and using
statistics based on the validation or test sets is a serious data leakage that will make your
models look great during evaluation, even if they're likely to perform poorly when effectively
deployed.

keyboard_arrow_down Continuous Attributes

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 25/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

You can see that mpg, displacement, horsepower, weight, and acceleration are continuous
attributes, that is, they may be any numeric value.

cont_attr = ['mpg', 'disp', 'hp', 'weight', 'acc']

Continuous attributes are the bread and butter of deep learning models. We've already
discussed that these models cannot handle missing values and, as it turns out, they may also
have issues with values spread over wildly different ranges. When it comes to deep learning
models, predictable and, better yet, zero-centered ranges for features are a must.

Let's see how the attributes (other than fuel consumption, mpg, which is the target of our
prediction) fare in their own ranges of values:

train_features = train[cont_attr[1:]]
train_features.hist()

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 26/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

array([[<Axes: title={'center': 'disp'}>, <Axes: title={'center': 'hp'}>],


[<Axes: title={'center': 'weight'}>,
<Axes: title={'center': 'acc'}>]], dtype=object)

It doesn't look very good: not only are the ranges quite different from one another, but the ranges
are nowhere near zero-centered, as expected from real-world physical attributes such as weight.
There's no such thing as a negative weight (except, maybe, for theoretical physicists!).

So, what do we do about it and, most importantly, why do we have to do something about it?
Let's start with the latter: without going into much detail, it suffices to know for now that deep
learning models are more easily trained if the attributes or features used to train them display
values in symmetrical ranges, preferably in low values, such as from minus three to three.
Otherwise, they may exhibit problematic behaviors during training, failing to converge to a
solution.

Therefore, it is best practice to bring all values to a more "digestible" range for the sake of the
model's health. The procedure that accomplishes this is called standardization or, sometimes,
normalization. It consists of subtracting the mean (thus zero-centering) of the attribute, and
dividing the result by the standard deviation (thus turning it into unit standard deviation). The
resulting attributes shall exhibit similar resulting ranges.

Let's start by computing both means and standard deviations:

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 27/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

train_means = train_features.mean()
train_standard_deviations = train_features.std()
train_means, train_standard_deviations

(disp 195.456439
hp 105.087121
weight 2984.075758
acc 15.432955
dtype: float64,
disp 106.255830
hp 39.017837
weight 869.802063
acc 2.743941
dtype: float64)

Then, let's standardize our features:

train_standardized_features = (train_features - train_means)/train_standard_deviations


train_standardized_features.mean(), train_standardized_features.std()

(disp -1.059758e-16
hp -1.648513e-16
weight 7.569702e-17
acc 4.003532e-16
dtype: float64,
disp 1.0
hp 1.0
weight 1.0
acc 1.0
dtype: float64)

Their means are zero, and their standard deviations are one, so it looks good. Let's visualize
them:

train_standardized_features.hist()

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 28/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

array([[<Axes: title={'center': 'disp'}>, <Axes: title={'center': 'hp'}>],


[<Axes: title={'center': 'weight'}>,
<Axes: title={'center': 'acc'}>]], dtype=object)

As you can see, standardization doesn't change the shape of the distribution, it only brings all
the features to a similar footing when it comes to their ranges.

Even though we've standardized our continuous features manually, we don't have to do it like
that. Scikit-Learn offers a StandardScaler class that can do this for us and, as we'll see later, we
can also use PyTorch's own transformations to standardize values, even if they are pixel values
on images!

We used the training set to define the standardization parameters, namely, the means and
standard deviations of our features. Now, we need to standardize the validation and training
sets using those same parameters.

Never use the validation and test sets to compute parameters for standardization, or for any
other preprocessing step!

val_features = val[cont_attr[1:]]
val_standardized_features = (val_features - train_means)/train_standard_deviations
val_standardized_features.mean(), val_standardized_features.std()

(disp -0.089282
hp -0.151349
weight -0.121501
acc 0.139288
dtype: float64,
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 29/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
disp 0.946465
hp 0.917051
weight 0.918898
acc 0.958077
dtype: float64)

Notice that the resulting means and standard deviations aren't quite zero and one, respectively.
That's expected since the validation set should have a similar, yet not quite exactly the same,
distribution as the training set.

If you ever get perfect zero mean and unit standard deviation on a standardized validation set,
there's a good chance you're making a mistake using statistics computed on top of the
validation set itself.

Finally, let's standardize the test set as well:

test_features = test[cont_attr[1:]]
test_standardized_features = (test_features - train_means)/train_standard_deviations

We'll get back to the topic of standardization/normalization a couple more times. First, we'll use
Scikit-Learn's StandardScaler for the task, and then we'll learn about normalizing batches of
data using PyTorch's own batch normalization.

The StandardScaler is part of Scikit-Learn, "an open source machine learning library that
supports supervised and unsupervised learning. It also provides various tools for model fitting,
data preprocessing, model selection, model evaluation, and many other utilities."

It is a convenient way of avoiding manually standardizing continuous features as we just did. All
it takes is to call its fit() method on the training set to compute the appropriate statistics (mean
and standard deviation), and then apply the standardization to all datasets using its transform()
method.

The fit() method takes a feature matrix X, a Numpy array usually in the shape (n_samples,
n_features). We can easily retrieve the two-dimensional Numpy array that contains the
underlying data of our dataframe and we have everything we need to have a functioning
StandardScaler.

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaler.fit(train_features.values)

▾ StandardScaler i ?
StandardScaler()

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 30/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

If we want, we can also check the computed statistics (it computes variance instead of
standard deviation, though):

scaler.mean_, scaler.var_

(array([ 195.45643939, 105.08712121, 2984.07575758, 15.43295455]),


array([1.12475350e+04, 1.51662499e+03, 7.53689888e+05, 7.50069430e+00]))

Once it has statistics (computed on the training set only), you can apply it to all your datasets:

standardized_data = {}
standardized_data['train'] = scaler.transform(train_features)
standardized_data['val'] = scaler.transform(val_features)
standardized_data['test'] = scaler.transform(test_features)

/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has featu


warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has featu
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:486: UserWarning: X has featu
warnings.warn(

To better streamline the process, we can write a standardize() function that:

takes a Pandas dataframe, a list of column names that are continuous attributes, and an
optional scaler
creates and trains a Scikit-Learn's StandardScaler if one isn't provided as an argument
returns a PyTorch tensor containing the standardized features and an instance of Scikit-
Learn's StandardScaler

from sklearn.preprocessing import StandardScaler


def standardize(df, cont_attr, scaler=None):
cont_X = df[cont_attr].values
if scaler is None:
scaler = StandardScaler()
scaler.fit(cont_X)
cont_X = scaler.transform(cont_X)
cont_X = torch.as_tensor(cont_X, dtype=torch.float32)
return cont_X, scaler

Using the above function, our standardization looks like this:

standardized_data = {}
# The training set is used to fit a scaler
standardized_data['train'], scaler = standardize(train_features, cont_attr[1:])

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 31/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

# The scaler is used as argument to the other datasets


standardized_data['val'], _ = standardize(val_features, cont_attr[1:], scaler)
standardized_data['test'], _ = standardize(test_features, cont_attr[1:], scaler)

keyboard_arrow_down Discrete and Categorical Attributes

The three remaining attributes, cylinders, model year, and origin are multi-valued discrete, that
is, there's a set of values each one of them may assume. The cars in the dataset may have
either 4, 6, or 8 cylinders, but no car may have 3.45 cylinders, for example. Even though the
values are discrete, there's still an underlying order to them: 8 cylinders are, indeed, twice as
much as 4 cylinders, but that's not always the case for discrete attributes.

Let's take a look at the origin attribute. The cars come from three different, although unnamed,
countries: 1, 2, and 3. The choice of numerical representation for countries may be misleading,
since country "3" is not three times as much as country "1". It would have probably been better
to use letters or abbreviations instead just to make the categorical nature of the attribute more
evident.

Sometimes, like in the case of cylinders, discrete attributes can be grouped together with
continuous attributes as numeric attributes. More often than not, though, discrete attributes are
considered categorical attributes, thus requiring some extra pre-processing to be handled by
deep learning models.

These pre-processing techniques involve converting each possible value in a categorical


attribute into a numerical array of a given length, but not necessarily the same length as the
number of unique values. The process may be called encoding or embedding, depending on
how it's performed.

Let's take a look at this process. Our goal here is to convert each possible value in a discrete or
categorical attribute into a numerical array of a given length (that does not need to match the
number of unique values). Before converting them into arrays, though, we need to encode them
as sequential numbers first.

Let's see what this looks like for the cyl attribute of our training dataset. It has only five unique
values: 3, 4, 5, 6, and 8 cylinders.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 32/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

cyls = sorted(train['cyl'].unique())
cyls

[3, 4, 5, 6, 8]

year = sorted(train['year'].unique())
year

[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82]

origin = sorted(train['origin'].unique())
origin

[1, 2, 3]

We can easily build a dictionary to map them into sequential numbers:

cyls_map = dict((v, i) for i, v in enumerate(cyls))


cyls_map

{3: 0, 4: 1, 5: 2, 6: 3, 8: 4}

Now imagine there's a lookup table with as many entries as unique values, each entry being a
numerical array of a given length (say, eight elements). Let's create such a lookup table filled
with random values as an illustration:

n_dim = 8
lookup_table = torch.randn((len(cyls), n_dim))
lookup_table

tensor([[ 0.6300, 0.4335, -0.6521, 0.1708, 1.2063, -0.7340, -0.7439, 0.0028],


[-1.0897, 1.1370, 1.7471, 0.3194, 0.4317, 0.1624, -0.0836, -0.6387],
[-1.3730, 0.5371, 0.6504, 0.4326, -0.2706, -0.3949, -0.1482, -0.9970],
[-0.3008, -0.8808, -0.0040, -1.0591, 1.2760, 2.4210, -1.0005, 0.7834],
[-0.0731, 0.7370, -0.2961, -1.6068, 0.1728, -1.2520, 0.3217, -0.1416]])

There are five rows, each corresponding to a unique number of cylinders. Three cylinders,
according to our mapping dictionary, corresponds to the first (index zero) row. Four cylinders, to
the second (index one) row, and so on, and so forth.

Let's say we'd like to retrieve the numerical array corresponding to six cylinders. We apply the
mapping to find the corresponding index (cyls_map[6]) and use the result to actually slice the
corresponding row from the lookup table (lookup_table[idx]):

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 33/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

idx = cyls_map[6]
lookup_table[idx]

tensor([-0.3008, -0.8808, -0.0040, -1.0591, 1.2760, 2.4210, -1.0005, 0.7834])

There we go! Now, any number of cylinders can easily be mapped to a sequence of eight
numerical values. It is as if any given number of cylinders, a categorical attribute, were now
represented by eight numerical features instead. We have just (re)invented embeddings! The
fact that these numbers are random is not necessarily an issue: we can simply turn the whole
lookup table into parameters of the model itself, so they are also learned during training. The
model will learn the best way to represent each value in a categorical attribute as a sequence of
numerical attributes! How cool is that?

PyTorch offers an Embedding class that wraps a random tensor like the one we've just created.
This is actually a layer, and we'll see how layers work in more detail in the next chapter. For now,
it should suffice to know that its arguments are the same as our own: the number of unique
values, and the desired number of elements - or dimensions - in the returned numerical array.

import torch.nn as nn
emb_table = nn.Embedding(len(cyls), n_dim)

The embedding layer, like any other layer in PyTorch, is also a model. Its weights are, surprise,
surprise, the lookup table itself. Besides, since it's a model, it can be called as such and its
expected input is a batch of indices. Let's try it out and see what we get out of it:

idx = cyls_map[6]
emb_table(torch.as_tensor([idx]))

tensor([[-2.4736, -1.1113, -0.0137, 0.4004, -0.0134, 0.0216, 0.0412, 0.1218]],


grad_fn=<EmbeddingBackward0>)

There we go, you created your first embeddings! Embeddings are an important part of modern
deep learning, and a fundamental piece of natural language processing, as we'll see in later
chapters. Notice that the values are actually different from our previous example because the
newly created emb_table instance initializes its own random tensor under the hood.

A special case of embedding is the one-hot encoding (OHE) approach: instead of letting the
model learn it during training, the mapping is fixed. In OHE, the numerical array has the same
length as the number of unique values and it has only one nonzero element. It works as if each
unique value were a dummy variable, for example: cyl3, cyl4, cyl5, cyl6, and cyl8, and only one
of those dummy variables may have a nonzero value.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 34/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

ohe_table = torch.eye(len(cyls))
ohe_table

tensor([[1., 0., 0., 0., 0.],


[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]])

idx = cyls_map[6]
ohe_table[idx]

tensor([0., 0., 0., 1., 0.])

Even though the embeddings themselves are going to be part of the model, we still need to
convert our categorical features into their corresponding sequential indices, so we can use them
to retrieve the right values from the embeddings' internal lookup table.

Instead of building dictionaries to manually encode categorical values into their sequential
indices, though, we can use yet another Scikit-Learn preprocessing utility: the OrdinalEncoder. It
works in a similar fashion as the StandardScaler: you can use its fit() method so it builds the
mapping between the original values and their corresponding sequential indices, and then you
can call its transform() method to actually perform the conversion. Let's see an example of this:

from sklearn.preprocessing import OrdinalEncoder


disc_attr = ['cyl', 'year', 'origin']
encoder = OrdinalEncoder()
encoder.fit(train[disc_attr])

▾ OrdinalEncoder i ?
OrdinalEncoder()

We can check the categories found for each one of the attributes (cylinders, year, and origin, in
our case):

encoder.categories

'auto'

Each value in a given list will be converted into its corresponding sequential index, and that's
exactly what the transform() method does:

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 35/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

train_cat_features = encoder.transform(train[disc_attr])
train_cat_features[:5]

array([[ 3., 5., 0.],


[ 1., 11., 1.],
[ 4., 8., 0.],
[ 2., 8., 1.],
[ 3., 9., 0.]])

Let's take a quick look at the resulting encoding for the first row:

the first column (cylinders) is three, thus corresponding to the fourth value in the first list
of categories, that is, six
the second column (year) is five, thus corresponding to the sixth value in the second list of
categories, that is, 75
the third column (origin) is zero, thus corresponding to the first value in the third list of
categories, that is, one

If we compare it to the original values in the first row, it's a match:

train[disc_attr].iloc[0]

cyl 6

year 75

origin 1

dtype: int64

Once again, to better streamline the process, we can write a function quite similar to the
previous one:

takes a Pandas dataframe, a list of column names that are categorical attributes, and an
optional encoder
creates and trains a Scikit-Learn's OrdinalEncoder if one isn't provided as an argument
returns a PyTorch tensor containing the encoded categorical features and an instance of
Scikit-Learn's OrdinalEncoder

def encode(df, cat_attr, encoder=None):


cat_X = df[cat_attr].values
if encoder is None:
encoder = OrdinalEncoder()
encoder.fit(cat_X)
cat_X = encoder.transform(cat_X)
cat_X = torch.as_tensor(cat_X, dtype=torch.int)

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 36/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

return cat_X, encoder

Using the above function, our encoding looks like this:

cat_data = {}
cat_data['train'], encoder = encode(train, disc_attr)
cat_data['val'], _ = encode(val, disc_attr, encoder)
cat_data['test'], _ = encode(test, disc_attr, encoder)

The resulting features are nothing but indices now. Later on, for each column in the results
(which corresponds to a particular categorical attribute) we'll use its values to retrieve their
embeddings. In our example with the cyl column (the first categorical attribute), it will look like
this:

emb_table(cat_data['train'][:, 0]) # cylinders is the first (zero) column

tensor([[-2.4736, -1.1113, -0.0137, ..., 0.0216, 0.0412, 0.1218],


[-0.9056, -0.4992, -0.1807, ..., -0.0689, 0.4786, 0.4684],
[ 0.7061, 1.5819, -0.1650, ..., 0.6547, 0.9400, 0.2905],
...,
[-0.9056, -0.4992, -0.1807, ..., -0.0689, 0.4786, 0.4684],
[ 0.7061, 1.5819, -0.1650, ..., 0.6547, 0.9400, 0.2905],
[-0.9056, -0.4992, -0.1807, ..., -0.0689, 0.4786, 0.4684]],
grad_fn=<EmbeddingBackward0>)

keyboard_arrow_down Target and Task


The target is the attribute you're trying to predict. If the target is a continuous attribute, such as
fuel consumption, we're dealing with a regression task. If the target is a categorical attribute,
such as the country of origin, we're dealing with a classification task.

In our example, we're indeed trying to predict fuel consumption (the mpg attribute), so ours is a
regression task. We're starting with a simple linear regression with a single feature, that is, we'll
be using only one (continuous) attribute to predict our target, fuel consumption. Of course, later
on, we'll expand our problem into a multivariate linear regression, thus including all (continuous)
attributes at first, and then add the categorical attributes to the mix while training a non-linear
model in Lab 2.

For now, let's pick hp as our single feature:

# _pt stands for PyTorch, in case you're wondering :-)


hp_idx = cont_attr.index('hp')
train_target_pt = torch.as_tensor(train[['mpg']].values, dtype=torch.float32)
train_single_feature_pt = standardized_data['train'][:, [hp_idx]]

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 37/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

train_target_pt[:5], train_single_feature_pt[:5]

(tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]),
tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]))

import matplotlib.pyplot as plt


plt.scatter(train_single_feature_pt, train_target_pt)
plt.xlabel('Horsepower (standardized)')
plt.ylabel('Fuel Consumption - miles per gallon')
plt.title('Training Set - HP x MPG')

Text(0.5, 1.0, 'Training Set - HP x MPG')

The relationship isn't quite linear, but there's clearly an inverse correlation between a car's power
and its fuel consumption, as you'd expect. A small 50 HP car is certainly much more fuel-
efficient (hence more miles per gallon) than a high-powered 200 HP sports car.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 38/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

keyboard_arrow_down TensorDataset

Cool, we have two tensors now, let's use them to build a TensorDataset! Tensor datasets are one
of the most basic types of datasets you'll find in PyTorch. They simply wrap a couple of tensors
containing your data - feature(s) and target(s) - so you can conveniently load your data in mini-
batches at will for training your model. We'll get back to it when we discuss PyTorch's data
loader in the next section.

from torch.utils.data import TensorDataset


train_ds = TensorDataset(train_single_feature_pt, train_target_pt)

PyTorch's datasets work pretty much like Python lists. You can think of a dataset as a list of
tuples, each tuple corresponding to one data point (features, target).

You can create your own, custom, dataset by inheriting from the Dataset class. Datasets need to
implement some basic methods such as init(self), getitem(self, index) and len(self).

If we check the source code of the TensorDataset, that's what we'll find:

class TensorDataset(Dataset[Tuple[Tensor, ...]]):


r"""Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
Args:
*tensors (Tensor): tensors that have the same size of the first dimension.
"""
tensors: Tuple[Tensor, ...]
def __init__(self, *tensors: Tensor) -> None:
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors),
"Size mismatch between tensors"
self.tensors = tensors
def __getitem__(self, index):
return tuple(tensor[index] for tensor in self.tensors)
def __len__(self):
return self.tensors[0].size(0)

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 39/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

File "<ipython-input-109-bf10485439c3>", line 7


tensors: Tuple[Tensor, ...]
^
IndentationError: unexpected indent

In the constructor (init()) method, it makes sure all tensors are of the same size, and assigns
them to its tensors attribute. In the getitem() method, which makes a dataset "sliceable" just like
a Python list, it loops over all tensors and builds a tuple containing the index-th element of each
tensor. Finally, in the len() method, it simply returns the first dimension of the first tensor (since
it is guaranteed they're all of the same size).

Simple enough, right? Let's retrieve a few elements from our dataset:

train_ds[:5]

(tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]),
tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]))

As expected, we got a tuple back, the first element being five data points from the first (feature)
tensor, the second element being the corresponding five data points from the second (target)
tensor. It really works like a list of tuples!

Tensor datasets are as simple as they can be, but PyTorch offers many other datasets, such as
the ImageFolder dataset that you can use with your own images, or many other built-in
datasets. We'll see them in more detail in the second part of this course while tackling computer
vision tasks.

Let's create datasets for our validation and test sets as well. We'll be skipping some
intermediate steps and creating tensor datasets directly out of the pandas dataframes:

val_ds = TensorDataset(standardized_data['val'][:, [hp_idx]],


torch.as_tensor(val[['mpg']].values, dtype=torch.float32))
test_ds = TensorDataset(standardized_data['test'][:, [hp_idx]],
torch.as_tensor(test[['mpg']].values, dtype=torch.float32))

PyTorch offers plenty of built-in datasets in both computer vision and natural language
processing areas.

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 40/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

There are datasets for image classification (e.g. CIFAR10, MNIST, SHVN), object detection,
image segmentation, optical flow, stereo matching, image pairs, image captioning, video
classification and prediction. For a complete list of available datasets, please check the
Datasets section of Torchvision documentation.

There are also datasets for text classification (e.g. AG News, IMDb, MNLI, SST2), language
modeling, machine translation, sequence tagging, question answering, and unsupervised
learning. For a complete list of available datasets for natural language processing, please check
the Datasets section of Torchtext documentation.

Perhaps you noticed that, so far, we've been handling "CPU" tensors only. That is actually by
design: while building a dataset, you may want to keep your data out of your precious, and
expensive, GPU memory. Only the data that is going to be actively used for training in any given
step - a mini-batch of data - should be sent to the GPU.

Mini-Batches
A mini-batch is a subset of a dataset, usually drawn randomly from it, and the number of data
points in a mini-batch is usually a power of two. Typical mini-batch sizes are 32, 64, 128, etc.,
but, in many cases, mini-batch size may be limited by the size of the available memory. This is
especially true for large models that take up a lot of space, where sometimes it is only feasible
to load one data point at a time. In these cases, the restriction imposed by hardware may be
circumvented by accumulating the results over time thus simulating a mini-batch.

For now, let's draw mini-batches from our dataset using PyTorch's DataLoader!

keyboard_arrow_down DataLoaders

Data loaders can be used to randomly draw a given number of data points - the mini-batch size -
out of a dataset. By default, they will return different mini-batches every time until the underlying
dataset runs out of available data points. At this point - pun very much intended - it will start
over.

The data loader is a rich class and it has many parameters. At first, we're focusing on a few of
them only:

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 41/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab

dataset: the underlying dataset it will be drawing samples from


batch_size: the number of data points in each batch returned by it
drop_last: drops the last mini-batch if there aren't batch_size data points in it
shuffle: shuffle (or not) the data

The last parameter, shuffle, is quite important. In the vast majority of cases, you should set
shuffle=True for the training set, the major exception to this rule being time series. Shuffling

https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 42/42

You might also like