03_Building_Your_First_Dataset.ipynb - Colab
03_Building_Your_First_Dataset.ipynb - Colab
ipynb - Colab
Learning Objectives
Before answering this (in the context of deep learning models), let's take a step back and learn
the difference between scalars, vectors, and multi-dimensional arrays such as matrices. Since
we'll be using tabular data to train our first model, let's draw analogies from a spreadsheet.
keyboard_arrow_down Scalars
A single value is called a scalar.
import torch
scalar = torch.tensor(18)
scalar
tensor(18)
keyboard_arrow_down Vectors
A list or one-dimensional array of values, like a single column in a spreadsheet, is called a
vector.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 1/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
vector = torch.tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
vector
tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
keyboard_arrow_down Matrices
A two-dimensional array of values, like a table in a spreadsheet, is called a matrix.
Let's create a matrix in PyTorch (we'll use just two columns, mpg and horsepower, to keep it
simple):
matrix = torch.tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150,
matrix
tensor([[ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14],
[130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150, 225]])
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 2/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Of course, you'll never have to type in the values from a spreadsheet. We'll conveniently load the
values directly from the file, first using pandas, and then using PyTorch's own data pipes. We'll
get back to it in the "Datasets" section.
keyboard_arrow_down Tensors
A three-dimensional array of values, like a collection of spreadsheets, each containing data for a
given month, is called a tensor.
From then on, be it four or forty-two dimensions, a multi-dimensional array is called a tensor. So,
technically speaking, if an array has three or more dimensions, it is a tensor.
You can easily create tensors in PyTorch using the tensor() method to create either a scalar or a
tensor, as we've been doing in the examples provided on the previous pages. Moreover, there are
methods to create tensors filled with ones, zeros, or random numbers: ones(), zeros(), rand(),
and randn() to name a few.
You can get the shape of a tensor using its shape attribute, but PyTorch also implements a
size() method that accomplishes the same thing.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 3/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
vector.shape, vector.size()
(torch.Size([14]), torch.Size([14]))
As expected, the shape of a scalar is an empty list since scalars are dimensionless (zero
dimensions).
matrix.size(), scalar.size()
While scalars are single numbers, thus having zero dimensions, one- and two-dimensional
arrays are called vectors and matrices, respectively, as we've seen in the examples above. But, in
order to make matters simple, it is commonplace to refer to any array with one or more
dimensions as a tensor.
In summary, everything is either a scalar or a tensor. There are tensors for data, and tensors for
parameters. Right now, we're dealing with the former, and we'll move on to the latter later in the
next chapter.
keyboard_arrow_down Numpy
NumPy brings the computational power of languages like C and Fortran to Python, a language
much easier to learn and use. Thanks to its performance, Numpy sits at the core of many
machine and deep learning libraries such as Scikit-Learn, Scipy, Pandas, and Matplotlib. For this
reason, it is fairly common to load tabular data from other sources, such as CSV or Excel files,
into a collection of Numpy arrays. Even when dealing with images, pixel values are often stored
inside Numpy arrays.
PyTorch tensors and Numpy arrays have a lot in common. You may create Numpy arrays using
its identically-named methods such as zeros(), ones(), rand(), and randn(), for example.
Moreover, you can easily switch between the two of them, arrays and tensors, using PyTorch's
numpy() and as_tensor() methods. The former converts a PyTorch tensor into a Numpy array,
while the latter creates a PyTorch tensor out of a Numpy array. Let's see them in action.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 4/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Numpy array:
import numpy as np
numpy_array = vector.numpy()
numpy_array
array([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
PyTorch tensor:
back_to_tensor = torch.as_tensor(numpy_array)
back_to_tensor
tensor([18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14])
There's one caveat, though: only "CPU" tensors can be converted into Numpy arrays. Every
tensor we created thus far is, by default, a "CPU" tensor. We'll learn about different types of
tensors shortly, in the "Devices" section.
Reshaping Tensors
One of the most common operations you'll need to perform is to reshape a tensor into a
different, well, shape!
There are two data points, and their corresponding features are organized in a two-by-three
shape in the tensor at the top. In order to use these features to train a linear or a logistic
regression, however, you'd need to have the features lined up instead. The flattened tensor at the
bottom represents this, the flattened version of both tensors.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 5/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Although the operation itself is quite simple, there are a few pitfalls you need to avoid while
reshaping your tensors. Let's go over a few examples.
original_tensor[0, 1] = 2
original_tensor, reshaped_tensor
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 6/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Moreover, if you created your tensor from a Numpy array, the two of them, array and tensor, are
also sharing the underlying data.
numpy_array[-1] = 1000
numpy_array, vector
(array([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15,
14, 15, 1000]),
tensor([ 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14,
15, 1000]))
In order to effectively duplicate the data and create a new, independent, tensor, you can use the
clone() method instead.
cloned_tensor = original_tensor.clone()
cloned_tensor
Now, if you make changes to the original tensor, they won't be reflected in the new tensor
anymore.
original_tensor[0, 1] = 3
original_tensor, cloned_tensor
transposed_tensor = original_tensor.t()
transposed_tensor.view(1, 6)
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 7/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-40-13ca45c67fa7> in <cell line: 2>()
1 transposed_tensor = original_tensor.t()
----> 2 transposed_tensor.view(1, 6)
RuntimeError: view size is not compatible with input tensor's size and stride (at
least one dimension spans across two contiguous subspaces). Use .reshape(...)
instead.
Remember, view() does not make any copies of the data, but reshape() does, so it will always
work, even if the tensors are not contiguous.
But, what does it mean to be contiguous? Simply put, it means two elements in the same row
must be next to each other in memory. This is always the case whenever a tensor is created
(like our original_tensor), but once we transpose it, we're not actually changing its allocation in
memory. Transposing, in this case, means traversing it differently, that is, jumping to a different
position in memory.
We can see the "rules" for moving to the next row or column by checking the tensor's stride
method:
original_tensor.stride(), transposed_tensor.stride()
In the original tensor, the stride is telling us that we need to skip three positions in memory to
get to the next row, while only one position for the next column. But, in the transposed tensor, it
is the other way around: we need to skip three positions to get to the next column.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 8/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
If we need to skip two or more positions to get to the next column, it means our tensor is not
contiguous anymore. Let's check it out:
transposed_tensor.is_contiguous(), original_tensor.is_contiguous()
(False, True)
Transposed tensor:
transposed_tensor
tensor([[1., 1.],
[3., 1.],
[1., 1.]])
Luckily, you can simply call the contiguous() method, and PyTorch will modify the data in
memory in such a way the data can be traversed in its typical fashion (a stride of one in the last
dimension). If the underlying data happens to be contiguous already, this is a zero-cost
operation.
transposed_tensor.contiguous().view(1, 6)
Finally, it is also possible to use the flatten() method instead, in case you're trying to make your
tensor one-dimensional.
transposed_tensor.flatten()
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 9/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Don't worry much about memory allocation, though. The purpose of this section was to make
you aware of and capable of addressing the error message at the top, should you run into it by
any chance.
Named Tensors
Named tensors are a long-awaited feature, even if they're still a prototype feature. Many, if not
most, implementation bugs - even worse, the silent kind of bug - in deep learning models arise
from the fact that the wrong dimensions are being used in a given operation.
You may be wondering how is it possible that such a serious bug may be a silent one, that is,
one that does not raise an exception and crashes the application?
In many cases, broadcasting is to blame. Broadcasting is both a blessing and a curse. While it
makes it extremely easy to perform operations using tensors of different shapes without the
need to explicitly replicate data along some dimension, it may also give you the illusion your
operation is the right one, even when it's not because you messed up the dimensions.
You've probably done similar operations many times without giving a second thought to why it
works so seamlessly. As it turns out, you have broadcasting to thank for this behavior. Under the
hood, PyTorch (or Numpy) will "stretch" the variable b so its shape matches that of variable a,
thus allowing the desired element-wise multiplication.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 10/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Moreover, it is actually more efficient to use broadcasting like that than building a tensor full of
2.0s to match the shapes!
What if we'd like to perform an element-wise multiplication? Broadcasting got us covered, it will
"understand" that mat2 was "meant" to be 3x3 instead.
mat1 * mat2
Broadcasting works by comparing dimensions of both tensors from right to left, and it will
"match" them if they are equal or one of them is one (so that particular value will be replicated
along that dimension). In the example above, these are the dimensions:
mat1.size(), mat2.size()
The right-most dimension is 3 for both tensors, so it is matched. Moving to the left, the first
dimension of one tensor is 1, so it is also matched. There we go, broadcasting can work its
magic! But, beware, if you were to transpose the second tensor (mat2) by mistake, broadcasting
still works!
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 11/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
mat2_wrong_shape = mat2.t()
mat1 * mat2_wrong_shape
What does this mean? It means that, if you transposed one of the tensors by mistake, it may still
produce a valid output. If you think it's unlikely that you'll ever get the dimensions in the wrong
order, think again: when it comes to tensors representing batches of images or sequences, it
isn't so uncommon to mix dimensions up.
Input:
named_mat1 * named_mat2
All is good and well, rows and columns are well aligned, and the result is as expected. Also,
notice that the names are propagated to the resulting tensor.
named_mat1 * named_mat2.t()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-53-7deb473f8df5> in <cell line: 1>()
----> 1 named_mat1 * named_mat2.t()
RuntimeError: Error when attempting to broadcast dims ['R', 'C'] and dims ['C',
'R']: dim 'C' and dim 'R' are at the same position from the right but do not match.
Great, we got an error! Even though broadcasting would happily return a 3x3 matrix, the
misalignment of the dimensions' names prevented that and rightfully raised an exception
warning us of our mistake.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 12/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Of course, it only works if both tensors are named. If one of them isn't named, broadcasting
keeps working as expected.
named_mat1 * mat2.t()
keyboard_arrow_down Devices
So far, all the tensors we have created are "CPU" tensors. It means the tensor is stored in the
computer's main memory and any operations performed on these tensors are handled by its
central processing unit, the CPU (e.g. an Intel Core i9 processor). The type of tensor is
designated by the device, a CPU in this case, that handles its operations.
We can easily check the device responsible for a given tensor by checking its device attribute:
device = original_tensor.device
device
device(type='cpu')
But the CPU is not the only device we can use to manipulate tensors. We can also use graphics
processing units (GPUs), tensor processing units (TPUs) or even "meta" (fake) devices. Let's
take a look at them!
It turns out, though, that matrix multiplication at scale can also be used to train deep learning
models. Initially, it wasn't easy to leverage their power for that purpose since programming a
GPU was quite challenging. It was NVIDIA's release of CUDA (Compute Unified Device
Architecture) and, later on, AMD's ROCm (Radeon Open Compute Ecosystem) that allowed deep
learning frameworks such as PyTorch to more easily use them to dramatically speed up training
times.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 13/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
GPUs are freely available on most platforms, such as Google Colab and Kaggle, and you should
always check the availability of a GPU before starting training a model. Since these platforms
offer CUDA-compatible GPUs, we'll be focusing solely on them.
PyTorch makes it really easy to accomplish that: you only have to make a call to
torch.cuda.is_available() and name your device accordingly.
Once we specify a device, we can send our tensor to it using the aptly named method to():
sent_tensor = original_tensor.to(device)
sent_tensor.device
device(type='cpu')
If a GPU is not available, nothing will happen, and calling to() comes at no cost. So, it's safe to
always send your tensors (and later on, your models) to the specified device. This way, if you
share your code with someone else, or if you happen to run it in a different environment in the
future, your code will always leverage the power of a GPU, if one is available to you.
If a GPU is indeed available, the tensor's device will read cuda:0, as it now resides in the memory
of the first (and in most cases, only) GPU available. Moreover, if you check the tensor's type, it
will read torch.cuda.FloatTensor.
sent_tensor.type()
'torch.FloatTensor'
If you're lucky enough to have multiple GPUs at your disposal, you can check how many are
available to you, and their corresponding names using torch.cuda.device_count() and
torch.cuda.get_device_name(), respectively:
n_cudas = torch.cuda.device_count()
for i in range(n_cudas):
print(torch.cuda.get_device_name(i))
Once a tensor is sent to a GPU, it cannot be directly brought back to Numpy anymore.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 14/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
sent_tensor.numpy()
You need to bring them back to the CPU first (either using to('cpu') or cpu()), and only then call
the numpy() method.
sent_tensor.cpu().numpy()
Devices: TPU
Unlike GPUs, which were originally designed for gamers, TPUs - tensor processing units - as
their name suggests, were designed by Google to be used for training deep learning models in
TensorFlow.
TPUs are available on some platforms, such as Google Colab and Kaggle. Although they have
been designed to work with TensorFlow, it's also possible to leverage their immense power in
PyTorch using PyTorch/XLA, a package that connects PyTorch to Google's XLA (accelerated
linear algebra) library.
TPUs can be used to speed up training even more by using all its cores at once through
multiprocessing.
In order to load a model from disk, though, you need to create an instance of the untrained
model first, so you have where to load the model into. Meta devices allow you to create
"dummy", empty, models that would be too large to fit in memory, thus making it possible to
work around hardware constraints by only partially loading a model.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 15/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Mission accomplished! The fake tensor does not contain any data, as expected. Now let's create
a REALLY huge fake tensor.
The tensor above, should it be a real tensor, would have 10 BILLION 32-bit float elements. Let's
see what happens if we try to create a regular tensor of the same size.
Unless your computer has over 40 gigabytes of free RAM, you'll get an error. Fake tensors are
useful to handle tensors - and models - that are too large to fit into memory.
We won't be using these tensors in this course, but if you venture into using really large models,
you're already aware of your options.
keyboard_arrow_down Datasets
It is time to get our hands a little dirty with some tiny, yet real, data. Let's start by loading the
Auto MPG Dataset directly from the UCI Machine Learning Repository using pandas' read_csv()
method and a URL.
Its description reads: "The data concerns city-cycle fuel consumption in miles per gallon, to be
predicted in terms of 3 multivalued discrete and 5 continuous attributes."
In this dataset, values are separated by spaces, missing values are represented by a question
mark. The columns, or attributes, as stated in the repository, are as follows:
mpg: continuous
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 16/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
weight: continuous
acceleration: continuous
model year: multi-valued discrete
origin: multi-valued discrete
car name: string (unique for each instance)
The last column, car name, is actually separated by tabs (instead of spaces), so we're
considering the cars' names as comments while loading the dataset.
Pandas
To load tabular data, such as CSV or Excel files, one of the most popular choices is the Pandas
package, an open-source data analysis and manipulation tool. Pandas' strength lies in its
dataframes, a spreadsheet-like structure that contains two-dimensional data and its
corresponding labels. A dataframe is composed of a sequence of series, each series
representing a column, its values stored as Numpy arrays.
You can use methods such as read_csv() and read_excel() to load your data, each method
offering plenty of arguments to account for different separators, the existence or not of column
headers, comments, missing data, and more. We'll be using the former to load our dataset.
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cyl', 'disp', 'hp', 'weight', 'acc', 'year', 'origin']
df = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipiniti
df
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 17/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
A dataframe can be easily sliced, both column- and row-wise. Retrieving values from a single
column (thus resulting in a Pandas series) is as simple as that:
df['mpg']
mpg
0 18.0
1 15.0
2 18.0
3 16.0
4 17.0
... ...
393 27.0
394 44.0
395 32.0
396 28.0
397 31.0
dtype: float64
A Pandas series works as a wrapper around the underlying Numpy array that contains its data,
which you can retrieve using the values attribute:
df['mpg'].values[:5]
df[['mpg', 'hp']]
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 18/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
mpg hp
0 18.0 130.0
1 15.0 165.0
2 18.0 150.0
3 16.0 150.0
4 17.0 140.0
The dataframe itself also has its own values attribute, which will give you access to a two-
dimensional Numpy array containing the whole data:
df[['mpg', 'hp']].values[:5]
To subset the dataframe, you can use its iloc attribute, which allows for selecting rows based on
their index:
df.iloc[:5]
It is also possible to use a boolean Series to conditionally subset the rows of a dataframe:
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 19/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 20/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Train-Validation-Test Split
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 21/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
The purpose of the split is to simulate the arrival of new data, so you can make adjustments to
your model if training doesn't go well and, once you're happy with it, to make a final assessment
before going live with it. Each split, training, validation, and test, has its own purpose, as
described below. It is also important to highlight that the split should always be the first thing
you do—no preprocessing, no transformations; nothing happens before the split.
Training Set: the data you use to train your model - you can use and abuse this data!
Validation Set: the data you should only use for hyper-parameter tuning, that is, comparing
differently parameterized models trained on the training data, to decide which parameters
are best. You should use, but not abuse this data, as it is intended to provide an unbiased
evaluation of your model and, if you mess around with it too much, you'll end up
incorporating knowledge about it in your model without even noticing.
Test Set: the data you should use only once, when you are done with everything else, to
check if your model is still performing well. We like to pretend this is data from the "future"
- that particular day in the future when our model is ready to give it a go in the real world!
So, until that day, we cannot know this data, as the future hasn't arrived yet.
df['year'].values[:50]
array([70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71,
71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71])
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 22/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Although it may make sense if you're handling data in a spreadsheet (making it easier for you to
look something up in it), a predefined ordering is potentially an issue for training and evaluating
a model. Ideally, we would like to have all our sets - training, validation, and test - containing
similar information. Taking the example of the year column, we'd like to have cars from 1970 to
1982 in all three sets. The easiest and fastest way to accomplish that is to simply shuffle the
data first.
We can use the sample() method of a Pandas dataframe to sample, that is, to draw data points
from the dataframe in random order. The trick here is to draw the whole dataset using sample
(frac=1), so we got our full dataframe back, but in a different order. The resulting dataframe still
has its original index values, but we can easily drop it using the reset_index() method.
To actually perform the split, we'll use Scikit-Learn's train_test_split() twice, once for splitting the
data into train and test sets, and then to subdivide the training data into train and validation sets.
Ensuring the quality and consistency of your data is of the utmost importance. The most basic
checks you can do are looking for missing values and outliers in your data. Let's start with the
latter. Outliers are values that are, literally, "off the charts": they may be produced by a
measurement or input (e.g. typing) errors, in which case they are not real and thus must be
handled; but they may also be legitimate, sometimes indicating an anomaly, in which case they
may be exactly the target of your model. Outliers of the first kind, the errors, may affect model
training negatively and badly skew its predictions. There are many techniques for removing
outliers but we won't be delving into this topic here.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 23/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Unlike outliers, missing values are easy to spot, they show up as NaN (Not a Number) in a
dataframe. In Deep Learning models, NaN values propagate like an infectious disease: any
operation between an actual number and a NaN value results in another NaN value.
The process of "fixing" missing values is called imputation, that is, replacing the missing data
with substituted values. There are many techniques to accomplish this, from using the mean or
median of the corresponding column in the training data, to more sophisticated approaches
using other Machine Learning algorithms to "predict" the missing value.
Any imputation is based on assumptions you make about its nature, so it will necessarily lead to
slightly modifying the data distribution. Alternatively, if you can afford to lose some data points,
it's also possible to simply discard any data points containing missing values.
is_missing_attr = train.isna()
n_missing_attr = is_missing_attr.sum(axis=1)
train[n_missing_attr > 0]
There are three cars with missing horsepower information in our training set. While tree-based
algorithms such as Random Forests (RF) or Gradient-Boosted Trees (GBT) can easily handle
missing data, missing values are a big no-no when it comes to neural networks.
In order to keep things simple in our small example, let's simply drop any rows that contain
missing values.
train.dropna(inplace=True)
train
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 24/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
Then, let's do the same for our validation and test sets. If we had chosen to perform missing
value imputation, we would have to apply the same rules used in the training set for the
validation and test sets as well.
val.dropna(inplace=True)
test.dropna(inplace=True)
You should never use the validation or test sets as a source for any kind of data preprocessing
(such as imputing data). Using statistics computed on the validation or test sets are akin to
using statistics from the future, that is, computed on the data your users will eventually send to
your application or model. Obviously, you cannot know these values beforehand, and using
statistics based on the validation or test sets is a serious data leakage that will make your
models look great during evaluation, even if they're likely to perform poorly when effectively
deployed.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 25/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
You can see that mpg, displacement, horsepower, weight, and acceleration are continuous
attributes, that is, they may be any numeric value.
Continuous attributes are the bread and butter of deep learning models. We've already
discussed that these models cannot handle missing values and, as it turns out, they may also
have issues with values spread over wildly different ranges. When it comes to deep learning
models, predictable and, better yet, zero-centered ranges for features are a must.
Let's see how the attributes (other than fuel consumption, mpg, which is the target of our
prediction) fare in their own ranges of values:
train_features = train[cont_attr[1:]]
train_features.hist()
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 26/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
It doesn't look very good: not only are the ranges quite different from one another, but the ranges
are nowhere near zero-centered, as expected from real-world physical attributes such as weight.
There's no such thing as a negative weight (except, maybe, for theoretical physicists!).
So, what do we do about it and, most importantly, why do we have to do something about it?
Let's start with the latter: without going into much detail, it suffices to know for now that deep
learning models are more easily trained if the attributes or features used to train them display
values in symmetrical ranges, preferably in low values, such as from minus three to three.
Otherwise, they may exhibit problematic behaviors during training, failing to converge to a
solution.
Therefore, it is best practice to bring all values to a more "digestible" range for the sake of the
model's health. The procedure that accomplishes this is called standardization or, sometimes,
normalization. It consists of subtracting the mean (thus zero-centering) of the attribute, and
dividing the result by the standard deviation (thus turning it into unit standard deviation). The
resulting attributes shall exhibit similar resulting ranges.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 27/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
train_means = train_features.mean()
train_standard_deviations = train_features.std()
train_means, train_standard_deviations
(disp 195.456439
hp 105.087121
weight 2984.075758
acc 15.432955
dtype: float64,
disp 106.255830
hp 39.017837
weight 869.802063
acc 2.743941
dtype: float64)
(disp -1.059758e-16
hp -1.648513e-16
weight 7.569702e-17
acc 4.003532e-16
dtype: float64,
disp 1.0
hp 1.0
weight 1.0
acc 1.0
dtype: float64)
Their means are zero, and their standard deviations are one, so it looks good. Let's visualize
them:
train_standardized_features.hist()
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 28/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
As you can see, standardization doesn't change the shape of the distribution, it only brings all
the features to a similar footing when it comes to their ranges.
Even though we've standardized our continuous features manually, we don't have to do it like
that. Scikit-Learn offers a StandardScaler class that can do this for us and, as we'll see later, we
can also use PyTorch's own transformations to standardize values, even if they are pixel values
on images!
We used the training set to define the standardization parameters, namely, the means and
standard deviations of our features. Now, we need to standardize the validation and training
sets using those same parameters.
Never use the validation and test sets to compute parameters for standardization, or for any
other preprocessing step!
val_features = val[cont_attr[1:]]
val_standardized_features = (val_features - train_means)/train_standard_deviations
val_standardized_features.mean(), val_standardized_features.std()
(disp -0.089282
hp -0.151349
weight -0.121501
acc 0.139288
dtype: float64,
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 29/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
disp 0.946465
hp 0.917051
weight 0.918898
acc 0.958077
dtype: float64)
Notice that the resulting means and standard deviations aren't quite zero and one, respectively.
That's expected since the validation set should have a similar, yet not quite exactly the same,
distribution as the training set.
If you ever get perfect zero mean and unit standard deviation on a standardized validation set,
there's a good chance you're making a mistake using statistics computed on top of the
validation set itself.
test_features = test[cont_attr[1:]]
test_standardized_features = (test_features - train_means)/train_standard_deviations
We'll get back to the topic of standardization/normalization a couple more times. First, we'll use
Scikit-Learn's StandardScaler for the task, and then we'll learn about normalizing batches of
data using PyTorch's own batch normalization.
The StandardScaler is part of Scikit-Learn, "an open source machine learning library that
supports supervised and unsupervised learning. It also provides various tools for model fitting,
data preprocessing, model selection, model evaluation, and many other utilities."
It is a convenient way of avoiding manually standardizing continuous features as we just did. All
it takes is to call its fit() method on the training set to compute the appropriate statistics (mean
and standard deviation), and then apply the standardization to all datasets using its transform()
method.
The fit() method takes a feature matrix X, a Numpy array usually in the shape (n_samples,
n_features). We can easily retrieve the two-dimensional Numpy array that contains the
underlying data of our dataframe and we have everything we need to have a functioning
StandardScaler.
▾ StandardScaler i ?
StandardScaler()
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 30/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
If we want, we can also check the computed statistics (it computes variance instead of
standard deviation, though):
scaler.mean_, scaler.var_
Once it has statistics (computed on the training set only), you can apply it to all your datasets:
standardized_data = {}
standardized_data['train'] = scaler.transform(train_features)
standardized_data['val'] = scaler.transform(val_features)
standardized_data['test'] = scaler.transform(test_features)
takes a Pandas dataframe, a list of column names that are continuous attributes, and an
optional scaler
creates and trains a Scikit-Learn's StandardScaler if one isn't provided as an argument
returns a PyTorch tensor containing the standardized features and an instance of Scikit-
Learn's StandardScaler
standardized_data = {}
# The training set is used to fit a scaler
standardized_data['train'], scaler = standardize(train_features, cont_attr[1:])
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 31/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
The three remaining attributes, cylinders, model year, and origin are multi-valued discrete, that
is, there's a set of values each one of them may assume. The cars in the dataset may have
either 4, 6, or 8 cylinders, but no car may have 3.45 cylinders, for example. Even though the
values are discrete, there's still an underlying order to them: 8 cylinders are, indeed, twice as
much as 4 cylinders, but that's not always the case for discrete attributes.
Let's take a look at the origin attribute. The cars come from three different, although unnamed,
countries: 1, 2, and 3. The choice of numerical representation for countries may be misleading,
since country "3" is not three times as much as country "1". It would have probably been better
to use letters or abbreviations instead just to make the categorical nature of the attribute more
evident.
Sometimes, like in the case of cylinders, discrete attributes can be grouped together with
continuous attributes as numeric attributes. More often than not, though, discrete attributes are
considered categorical attributes, thus requiring some extra pre-processing to be handled by
deep learning models.
Let's take a look at this process. Our goal here is to convert each possible value in a discrete or
categorical attribute into a numerical array of a given length (that does not need to match the
number of unique values). Before converting them into arrays, though, we need to encode them
as sequential numbers first.
Let's see what this looks like for the cyl attribute of our training dataset. It has only five unique
values: 3, 4, 5, 6, and 8 cylinders.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 32/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
cyls = sorted(train['cyl'].unique())
cyls
[3, 4, 5, 6, 8]
year = sorted(train['year'].unique())
year
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82]
origin = sorted(train['origin'].unique())
origin
[1, 2, 3]
{3: 0, 4: 1, 5: 2, 6: 3, 8: 4}
Now imagine there's a lookup table with as many entries as unique values, each entry being a
numerical array of a given length (say, eight elements). Let's create such a lookup table filled
with random values as an illustration:
n_dim = 8
lookup_table = torch.randn((len(cyls), n_dim))
lookup_table
There are five rows, each corresponding to a unique number of cylinders. Three cylinders,
according to our mapping dictionary, corresponds to the first (index zero) row. Four cylinders, to
the second (index one) row, and so on, and so forth.
Let's say we'd like to retrieve the numerical array corresponding to six cylinders. We apply the
mapping to find the corresponding index (cyls_map[6]) and use the result to actually slice the
corresponding row from the lookup table (lookup_table[idx]):
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 33/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
idx = cyls_map[6]
lookup_table[idx]
There we go! Now, any number of cylinders can easily be mapped to a sequence of eight
numerical values. It is as if any given number of cylinders, a categorical attribute, were now
represented by eight numerical features instead. We have just (re)invented embeddings! The
fact that these numbers are random is not necessarily an issue: we can simply turn the whole
lookup table into parameters of the model itself, so they are also learned during training. The
model will learn the best way to represent each value in a categorical attribute as a sequence of
numerical attributes! How cool is that?
PyTorch offers an Embedding class that wraps a random tensor like the one we've just created.
This is actually a layer, and we'll see how layers work in more detail in the next chapter. For now,
it should suffice to know that its arguments are the same as our own: the number of unique
values, and the desired number of elements - or dimensions - in the returned numerical array.
import torch.nn as nn
emb_table = nn.Embedding(len(cyls), n_dim)
The embedding layer, like any other layer in PyTorch, is also a model. Its weights are, surprise,
surprise, the lookup table itself. Besides, since it's a model, it can be called as such and its
expected input is a batch of indices. Let's try it out and see what we get out of it:
idx = cyls_map[6]
emb_table(torch.as_tensor([idx]))
There we go, you created your first embeddings! Embeddings are an important part of modern
deep learning, and a fundamental piece of natural language processing, as we'll see in later
chapters. Notice that the values are actually different from our previous example because the
newly created emb_table instance initializes its own random tensor under the hood.
A special case of embedding is the one-hot encoding (OHE) approach: instead of letting the
model learn it during training, the mapping is fixed. In OHE, the numerical array has the same
length as the number of unique values and it has only one nonzero element. It works as if each
unique value were a dummy variable, for example: cyl3, cyl4, cyl5, cyl6, and cyl8, and only one
of those dummy variables may have a nonzero value.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 34/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
ohe_table = torch.eye(len(cyls))
ohe_table
idx = cyls_map[6]
ohe_table[idx]
Even though the embeddings themselves are going to be part of the model, we still need to
convert our categorical features into their corresponding sequential indices, so we can use them
to retrieve the right values from the embeddings' internal lookup table.
Instead of building dictionaries to manually encode categorical values into their sequential
indices, though, we can use yet another Scikit-Learn preprocessing utility: the OrdinalEncoder. It
works in a similar fashion as the StandardScaler: you can use its fit() method so it builds the
mapping between the original values and their corresponding sequential indices, and then you
can call its transform() method to actually perform the conversion. Let's see an example of this:
▾ OrdinalEncoder i ?
OrdinalEncoder()
We can check the categories found for each one of the attributes (cylinders, year, and origin, in
our case):
encoder.categories
'auto'
Each value in a given list will be converted into its corresponding sequential index, and that's
exactly what the transform() method does:
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 35/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
train_cat_features = encoder.transform(train[disc_attr])
train_cat_features[:5]
Let's take a quick look at the resulting encoding for the first row:
the first column (cylinders) is three, thus corresponding to the fourth value in the first list
of categories, that is, six
the second column (year) is five, thus corresponding to the sixth value in the second list of
categories, that is, 75
the third column (origin) is zero, thus corresponding to the first value in the third list of
categories, that is, one
train[disc_attr].iloc[0]
cyl 6
year 75
origin 1
dtype: int64
Once again, to better streamline the process, we can write a function quite similar to the
previous one:
takes a Pandas dataframe, a list of column names that are categorical attributes, and an
optional encoder
creates and trains a Scikit-Learn's OrdinalEncoder if one isn't provided as an argument
returns a PyTorch tensor containing the encoded categorical features and an instance of
Scikit-Learn's OrdinalEncoder
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 36/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
cat_data = {}
cat_data['train'], encoder = encode(train, disc_attr)
cat_data['val'], _ = encode(val, disc_attr, encoder)
cat_data['test'], _ = encode(test, disc_attr, encoder)
The resulting features are nothing but indices now. Later on, for each column in the results
(which corresponds to a particular categorical attribute) we'll use its values to retrieve their
embeddings. In our example with the cyl column (the first categorical attribute), it will look like
this:
In our example, we're indeed trying to predict fuel consumption (the mpg attribute), so ours is a
regression task. We're starting with a simple linear regression with a single feature, that is, we'll
be using only one (continuous) attribute to predict our target, fuel consumption. Of course, later
on, we'll expand our problem into a multivariate linear regression, thus including all (continuous)
attributes at first, and then add the categorical attributes to the mix while training a non-linear
model in Lab 2.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 37/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
train_target_pt[:5], train_single_feature_pt[:5]
(tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]),
tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]))
The relationship isn't quite linear, but there's clearly an inverse correlation between a car's power
and its fuel consumption, as you'd expect. A small 50 HP car is certainly much more fuel-
efficient (hence more miles per gallon) than a high-powered 200 HP sports car.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 38/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
keyboard_arrow_down TensorDataset
Cool, we have two tensors now, let's use them to build a TensorDataset! Tensor datasets are one
of the most basic types of datasets you'll find in PyTorch. They simply wrap a couple of tensors
containing your data - feature(s) and target(s) - so you can conveniently load your data in mini-
batches at will for training your model. We'll get back to it when we discuss PyTorch's data
loader in the next section.
PyTorch's datasets work pretty much like Python lists. You can think of a dataset as a list of
tuples, each tuple corresponding to one data point (features, target).
You can create your own, custom, dataset by inheriting from the Dataset class. Datasets need to
implement some basic methods such as init(self), getitem(self, index) and len(self).
If we check the source code of the TensorDataset, that's what we'll find:
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 39/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
In the constructor (init()) method, it makes sure all tensors are of the same size, and assigns
them to its tensors attribute. In the getitem() method, which makes a dataset "sliceable" just like
a Python list, it loops over all tensors and builds a tuple containing the index-th element of each
tensor. Finally, in the len() method, it simply returns the first dimension of the first tensor (since
it is guaranteed they're all of the same size).
Simple enough, right? Let's retrieve a few elements from our dataset:
train_ds[:5]
(tensor([[-8.7263e-05],
[ 2.8327e-01],
[ 8.6497e-01],
[-1.7748e-01],
[ 3.2359e-01]]),
tensor([[18.0000],
[28.1000],
[19.4000],
[20.3000],
[20.2000]]))
As expected, we got a tuple back, the first element being five data points from the first (feature)
tensor, the second element being the corresponding five data points from the second (target)
tensor. It really works like a list of tuples!
Tensor datasets are as simple as they can be, but PyTorch offers many other datasets, such as
the ImageFolder dataset that you can use with your own images, or many other built-in
datasets. We'll see them in more detail in the second part of this course while tackling computer
vision tasks.
Let's create datasets for our validation and test sets as well. We'll be skipping some
intermediate steps and creating tensor datasets directly out of the pandas dataframes:
PyTorch offers plenty of built-in datasets in both computer vision and natural language
processing areas.
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 40/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
There are datasets for image classification (e.g. CIFAR10, MNIST, SHVN), object detection,
image segmentation, optical flow, stereo matching, image pairs, image captioning, video
classification and prediction. For a complete list of available datasets, please check the
Datasets section of Torchvision documentation.
There are also datasets for text classification (e.g. AG News, IMDb, MNLI, SST2), language
modeling, machine translation, sequence tagging, question answering, and unsupervised
learning. For a complete list of available datasets for natural language processing, please check
the Datasets section of Torchtext documentation.
Perhaps you noticed that, so far, we've been handling "CPU" tensors only. That is actually by
design: while building a dataset, you may want to keep your data out of your precious, and
expensive, GPU memory. Only the data that is going to be actively used for training in any given
step - a mini-batch of data - should be sent to the GPU.
Mini-Batches
A mini-batch is a subset of a dataset, usually drawn randomly from it, and the number of data
points in a mini-batch is usually a power of two. Typical mini-batch sizes are 32, 64, 128, etc.,
but, in many cases, mini-batch size may be limited by the size of the available memory. This is
especially true for large models that take up a lot of space, where sometimes it is only feasible
to load one data point at a time. In these cases, the restriction imposed by hardware may be
circumvented by accumulating the results over time thus simulating a mini-batch.
For now, let's draw mini-batches from our dataset using PyTorch's DataLoader!
keyboard_arrow_down DataLoaders
Data loaders can be used to randomly draw a given number of data points - the mini-batch size -
out of a dataset. By default, they will return different mini-batches every time until the underlying
dataset runs out of available data points. At this point - pun very much intended - it will start
over.
The data loader is a rich class and it has many parameters. At first, we're focusing on a few of
them only:
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 41/42
2/5/25, 11:01 AM 03_Building_Your_First_Dataset.ipynb - Colab
The last parameter, shuffle, is quite important. In the vast majority of cases, you should set
shuffle=True for the training set, the major exception to this rule being time series. Shuffling
https://colab.research.google.com/drive/1coOohWTJua0FlUgMiuHF-SnsAr1t-APd?usp=sharing 42/42