A Visual Intro To Numpy and Data Representation
A Visual Intro To Numpy and Data Representation
(/)
Blog (/) About (/about)
In this post, we’ll look at some of the main ways to use NumPy and how it can
represent different types of data (tables, images, text…etc) before we an serve
them to machine learning models.
import numpy as np
Creating Arrays
We can create a NumPy array (a.k.a. the mighty ndarray
(https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)) by passing a
python list to it and using ` np.array()`. In this case, python creates the array we
can see on the right here:
There are often cases when we want NumPy to initialize the values of the array
for us. NumPy provides methods like ones(), zeros(), and random.random() for
these cases. We just pass them the number of elements we want it to generate:
Once we’ve created our arrays, we can start to manipulate them in interesting
ways.
Array Arithmetic
Let’s create two NumPy arrays to showcase their usefulness. We’ll call them
data and ones:
Adding them up position-wise (i.e. adding the values of each row) is as simple as
typing data + ones:
When I started learning such tools, I found it refreshing that an abstraction like
this makes me not have to program such a calculation in loops. It’s a wonderful
abstraction that allows you to think about problems at a higher level.
There are often cases when we want carry out an operation between an array
and a single number (we can also call this an operation between a vector and a
scalar). Say, for example, our array represents distance in miles and we want to
convert it to kilometers. We simply say data * 1.6:
See how NumPy understood that operation to mean that the multiplication should
happen with each cell? That concept is called broadcasting, and it’s very useful.
Indexing
We can index and slice NumPy arrays in all the ways we can slice python lists:
Aggregation
Additional benefits NumPy gives us are aggregation functions:
In addition to min, max, and sum, you get all the greats like mean to get the
average, prod to get the result of multiplying all the elements together, std to get
standard deviation, and plenty of others
(https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-
arrays-aggregates.html).
In more dimensions
All the examples we’ve looked at deal with vectors in one dimension. A key part
of the beauty of NumPy is its ability to apply everything we’ve looked at so far to
any number of dimensions.
Creating Matrices
We can pass python lists of lists in the following shape to have NumPy create a
matrix to represent them:
np.array([[1,2],[3,4]])
We can also use the same methods we mentioned above (ones(), zeros(),
and random.random()) as long as we give them a tuple describing the
dimensions of the matrix we are creating:
Matrix Arithmetic
We can add and multiply matrices using arithmetic operators (+-*/) if the two
matrices are the same size. NumPy handles those as position-wise operations:
We can get away with doing these arithmetic operations on matrices of different
size only if the different dimension is one (e.g. the matrix has only one column or
one row), in which case NumPy uses its broadcast rules for that operation:
Dot Product
Matrix Indexing
Indexing and slicing operations become even more useful when we’re
manipulating matrices:
Matrix Aggregation
In a lot of ways, dealing with a new dimension is just adding a comma to the
parameters of a NumPy function:
Practical Usage
And now for the payoff. Here are some examples of the useful things NumPy will
help you through.
Formulas
Both the predictions and labels vectors contain three values. Which means n has
a value of three. After we carry out the subtraction, we end up with the values
looking like this:
Data Representation
Think of all the data types you’ll need to crunch and build models around
(spreadsheets, images, audio…etc). So many of them are perfectly suited for
representation in an n-dimensional array:
The same goes for time-series data (for example, the price of a stock over time).
Images
◦ If the image is black and white (a.k.a. grayscale), each pixel can be
represented by a single number (commonly between 0 (black) and 255
(white)). Want to crop the top left 10 x 10 pixel part of the image? Just tell
NumPy to get you image[:10,:10].
Language
If we’re dealing with text, the story is a little different. The numeric representation
of text requires a step of building a vocabulary (an inventory of all the unique
words the model knows) and an embedding step (/illustrated-word2vec/). Let us
see the steps of numerically representing this (translated) quote by an ancient
spirit:
The sentence can then be broken into an array of tokens (words or parts of words
based on common rules):
These ids still don’t provide much information value to a model. So before feeding
a sequence of words to a model, the tokens/words need to be replaced with their
embeddings (50 dimension word2vec embedding (/illustrated-word2vec/) in this
case):
You can see that this NumPy array has the dimensions [embedding_dimension x
sequence_length]. In practice these would be the other way around, but I’m
presenting it this way for visual consistency. For performance reasons, deep
learning models tend to preserve the first dimension for batch size (because the
model can be trained faster if multiple examples are trained in parallel). This is a
clear case where reshape() becomes super useful. A model like BERT
(/illustrated-bert/), for example, would expect its inputs in the shape: [batch_size,
sequence_length, embedding_size].
This is now a numeric volume that a model can crunch and do useful things with.
I left the other rows empty, but they’d be filled with other examples for the model
to train on (or predict).