In this article, we will learn about Memory management in pandas.
When we work with pandas there is no doubt that you will always store the big data for better analysis. While dealing with the larger data, we should be more concerned about the memory that we use. There is no problem when you work with small datasets. It does not cause any issues. But we can program without dealing with memory issues in larger datasets.
Now we will see about how to reduce errors and memory consumption. It makes our work easier by speeding up the computation.
Finding Memory usage
Info() :
Info() methods return the summary of the dataframe.
syntax: DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
This will print the short and sweet summary of the dataframe. It will also give the memory usage taken by the data frame when we mention it as the parameter. For the parameter, we should mention memory_usage as "deep".
Python
import pandas as pd
df = pd.read_csv(data.csv)
df.info(memory_usage="deep")
Output:

Memory_usage():
Pandas memory_usage() function returns the memory usage of the Index. It returns the sum of the memory used by all the individual labels present in the Index.
Syntax: DataFrame.memory_usage(index=True, deep=False)
However, Info() only gives the overall memory used by the data. This function Returns the memory usage of each column in bytes. It can be a more efficient way to find which column uses more memory in the data frame.
Python
import pandas as pd
df = pd.read_csv(data.csv)
df.memory_usage()
Output:

Ways to optimize memory in Pandas
Changing numeric columns to smaller dtype:
This is a very simple method to preserve the memory used by the program. Pandas as default stores the integer values as int64 and float values as float64. This actually takes more memory. Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.
Syntax: columnName.astype('float16')
Note: You cant store every value under int16 or float16. Some larger numbers still need to be stored as a larger datatype
Code:
Python
import pandas as pd
df = pd.read_csv('data.csv')
# Downcasting float64 to float16
df['price'].memory_usage()
df['price'] = df['price'].astype('float16')
df['price'].memory_usage()
Output:
173032
43354
Stop Loading the whole columns
It is common that we work with larger datasets, but there is no need of loading the entire dataset is necessary. Instead, we can load the specific columns on which you are going to work. By doing this we can restrict the amount of memory consumed to a very low value.
To do this, simply form a temporary dataset that contains only the values that you are going to work on.
Python
df.info(verbose = False, memory_usage = 'deep')
df = df[['price', 'sqft_living']]
df.info(verbose = False, memory_usage = 'deep')
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Columns: 22 entries, Unnamed: 0 to sqft_lot15
dtypes: float16(1), float64(5), int64(15), object(1)
memory usage: 4.8 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Columns: 2 entries, price to sqft_living
dtypes: float16(1), int64(1)
memory usage: 211.2 KB
In the above example, we can clearly see that loading the specific columns to the data frame reduces memory usage.
Deleting
Deleting is one of the ways to save space. It might be the best solution, to any memory problem. We may unknowingly save many data frames for training and testing, which is further not used in the process. We can also delete the unused columns. By deleting these can save more space. We can also delete the null columns present in the data frame which can also lead to saving more space. We can use the del keyword followed by the item you want to delete. This should remove the item.
Python
# importing the modules
import pandas as pd
import numpy
# Reading the csv file
df = pd.read_csv('data.csv')
# Deleting unwanted columns
del df["Unnamed: 0"]

Changing categorical columns
When dealing with some datasets we might have some categorical columns in which the whole column only consists of a fixed set of values repeatedly. This type of data is called categorical values which are basically classifications or groups in which a large set of rows has a similar categorical value. While dealing with these types of data we can simply assign each categorical value to some alphabet or an integer[basically encoding]. This method can exponentially reduce the amount of memory used by the program.
syntax: df['column_name'].replace('largerValue', 'alphabet', inplace=True)
Python
# importing the modules
import pandas
import numpy
# Reading the csv file
df = pd.read_csv('data.csv')
# Memory usage before replacing
df['bedrooms'].memory_usage()
# output--> 314640
# Replacing the categorical values
df['bedrooms'].replace('more than 2', 1, inplace=True)
df['bedrooms'].replace('less than 2', 0, inplace=True)
# Memory usage after replacing
df['bedrooms'].memory_usage()
# output--> 173032
Output:
314640
173032
After replacing the categorical values of the column bedrooms to 1's and 0's as you can see the memory usage has come to half.
Importing data in chunks
While dealing with data it is obvious that in some situations that we need to work with big data that are generally in gigabytes. Your machine probably loads of computing power that is a memory to load all this data into the memory at once and it is nearly impossible. In this situation, pandas have a very handy method to load the data in chunks which will basically help us to iterate through the data set and load it in chunks instead of loading all the data at once and pushing your machine to its extreme.
To do this we need to pass the parameter called chunk size while reading the data. which will give us chunks of data that we need to concatenate into a complete data set. In this way, we can reduce the memory usage of the machine at once. and give some time to recover its computational power to work with the flow.
syntax: pandas.read_csv('fileName.csv', chunksize=1000)
This method will return an iterable object of the dataset.
Python
# importing the mmodule
import pandas
# Reading the data in chunks
data = pandas.read_csv('data.csv', chunksize=1000)
# Concatenating the chunks together
df = pandas.concat(data)
Output:

Similar Reads
Memory Management in Python
Understanding Memory allocation is important to any software developer as writing efficient code means writing a memory-efficient code. Memory allocation can be defined as allocating a block of space in the computer memory to a program. In Python memory allocation and deallocation method is automati
4 min read
PostgreSQL Memory Management
Memory management in PostgreSQL is crucial for optimizing database performance. Tuning memory settings can improve query processing, indexing, and caching, making operations faster. PostgreSQL provides several configuration parameters that control how memory is allocated and usedIn this article, we
6 min read
Best Fit Memory Management in Python
Memory management is a critical aspect of any programming language, and Python is no exception. While Pythonâs built-in memory management is highly efficient for most applications, understanding memory management techniques like the Best Fit strategy can be beneficial, especially from a Data Structu
4 min read
Bypassing Pandas Memory Limitations
Pandas is a Python library used for analyzing and manipulating data sets but one of the major drawbacks of Pandas is memory limitation issues while working with large datasets since Pandas DataFrames (two-dimensional data structure) are kept in memory, there is a limit to how much data can be proces
3 min read
Memory leak using Pandas DataFrame
Pandas is a powerful and widely-used open-source data analysis and manipulation library for Python. It provides a DataFrame object that allows you to store and manipulate tabular data in rows and columns in a very intuitive way. Pandas DataFrames are powerful tools for working with data, but they ca
9 min read
Python | Pandas dataframe.memory_usage()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.Pandas dataframe.memory_usage() function return the memory usage of each column in byte
2 min read
Python | Pandas Index.memory_usage()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Index.memory_usage() function return the memory usage of the Index. It returns
2 min read
Pandas Dataframe Index
Index in pandas dataframe act as reference for each row in dataset. It can be numeric or based on specific column values. The default index is usually a RangeIndex starting from 0, but you can customize it for better data understanding. You can easily access the current index of a dataframe using th
3 min read
Implementation of Worst fit memory management in Python
Worst Fit memory management is a memory allocation algorithm where the largest available block of memory is allocated to a process requesting memory. It aims to maximize the utilization of memory by allocating the largest available block to a process. Examples: Let's consider an example with memory
2 min read
Handle Memory Error in Python
One common issue that developers may encounter, especially when working with loops, is a memory error. In this article, we will explore what a memory error is, delve into three common reasons behind memory errors in Python for loops, and discuss approaches to solve them. What is a Memory Error?A mem
3 min read