Sparse Matrices— What, When, and Why
If you are a Machine Learning practitioner, you may have faced terrible memory errors. In most cases, our go-to solution would be reducing the sample size, changing the algorithm or upgrading the machine by adding more memory. Nevertheless, those solutions may not be feasible at all times. Since most machine learning algorithms expect a dataset (so-called DataFrame) as an in-memory object, it's hard to overcome this issue. Here I'm trying to introduce one more method to add to your to-do list. We call it "Sparse Matrices."
Introduction
As most of us use pandas DataFrame more often, what is the real need for Sparse Matrices? The answer is space complexity and time complexity. As per my experience, pandas DataFrames become the worst in millions of rows and hundreds of columns. The reason for that is the way pandas DataFrams stores data.
Here is the on-disk and in-memory size comparison of the CSV file. For demonstration purposes, I have created a sample dataset with 70,000 samples and 450 columns. The dataset creation and all the relevant codes are available in my Github repo. For the sake of simplicity, let me explain this with a simple bar graph. Here we will compare disk/memory usage of the CSV file before reading as DataFrame and after reading as DataFrame.
import os
import pandas as pd#Read csv file
data = pd.read_csv("train.csv")
memory_usage = data.to_numpy().nbytes/1e6#Read the original file size using os module
disk_usage = os.path.getsize('/content/train.csv')/1e6#Lets plot results
plt.figure(figsize=(10,8))
plt.bar(x=["CSV","DataFrame"],height=[disk_size,memory_usage])
plt.title("Size comparison - CSV vs DataFrame")
plt.ylabel("Usage (MB)")
plt.show()
There is a significant difference. The possible reason could be the dataset's inefficient data types or null/empty fields. Since it is not the main focus here, I'm leaving the conversation behind. Feel free to try out and experiment with different datasets with the above code.
What are Sparse Matrices?
There are two common matrix types, dense and sparse. The major difference is sparse metrics have lots of zero values, while Dense metrics don't have any. Here is an example of a sparse matrix with four columns and four rows.
In the above matrix, 12 out of 16 are zeros. Just a simple question,
Can we store only non-zero values to compress the size of matrices while following our regular machine learning pipelines?
The simple answer is, Yes, We can!!
How so? Let me explain. We can easily convert higher sparse matrices to compressed sparse row matrices in short-form CSR matrices. It's not the only way to do so. But we require to apply matrix operations and efficiently access the matrix. There are more options to store sparse matrices. Some of them are,
- Dictionary of keys (DOK)
- List of Lists (LIL)
- Coordinate list (COO)
- Compressed row storage (CRS)
One of the drawbacks of sparse matrices is accessing the individual elements becomes more complex. Here is a quick guide for selecting a suitable data structure for your use case.
Your concern is efficient modification — Use DOK,LIL or COO. These are typically used to construct the matrices.
Your concern is efficient access and matrix operations — Use CSR or CSC
For simplicity, let's dive into some examples. Consider the below matrix.
Consider the case where we convert the above matrix into a CSR matrix. Let's begin with a simple example. Here, I'm using the scipy.sparse
module.
import numpy as np
from scipy import sparse#create the metrix with numpy
m = np.array([[1,0,0,0],
[0,1,2,0],
[0,0,0,0],
[2,1,1,1]])
#convert numpy array into scipy csr_matrix
csr_m = sparse.csr_matrix(m)
An important point to note here is that while our original matrix stores data in a 2-D array, the converted CSR matrix stores them in three 1-D arrays.
Value array
As the name implies, this stores all non-zero elements in the original matrix. The length of the array equals to the number of non-zero entries in the original matrix. In our example, we have 7 non-zero elements. Hence, the length of the value array is 7.
Column index array
This array stores the column indices of elements in the value array. (Note that zero-based indices are used here)
Row index array
This array stores the cumulative count of the nonzero values in all current and previous rows. row_index_array[j] encoding the total number of nonzeros above row j. The last element represents the number of non-zero elements in the original array. Length is m + 1; where m is defined as no. of rows in original matrix.
Well, now we have successfully converted our original matrix into CSR format. With the above explanation and image, I hope you understand how csr_matrices work under the hood.
When to use sparse matrices
As we discussed, we can achieve decent memory reduction by storing only non-zero entries. The important question at that point is,
If our dataset does not fit into memory, can we consider converting it into a sparse matrix?
It is hard to answer that without checking the dataset. I use two methods to check whether converting the dataset from a dense matrix to a sparse matrix is sensible.
Plot sample data
As a data scientist, you should plot your data before doing anything. So, apply it here as well.
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')#read dataset
data = pd.read_csv("train.csv")#plot samples
plt.figure(figsize=(8,8))
plt.spy(data.head(500).T)
plt.axis('off')
plt.grid(False)
plt.show()
How to read this image, and what does it tells us? First of all, here is the docstring for plt.spy()
function.
Plot the sparsity pattern of a 2D array. This visualizes the non-zero values of the array.
In the above image, all black dots represent the non-zero values. Hence, converting this data to a sparse matrix is worth your time.
Calculate sparsity in a numerical way
This is my second approach. It is simply calculating the sparsity using NumPy.
sparsity = 1- np.count_nonzero(data)/ data.size
print(sparsity)
If you run this code for the dataset mentioned above, you should get 0.886 as sparsity. This means more than 88% of data points are filled with zeros.
Why should we care about sparse matrices?
Well, there are many good reasons to use sparse matrices. Mainly they are,
- Huge savings in memory when compared to the basic approach.
- It often causes a reduction in model training time compared to the traditional approach.
And the good news is that almost any native algorithms in sklearn API now support csr_matrix as an input.
Here is the example from sklearn.ensemble.RandomForestClassifier
X{array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsc_matrix
.
Let's experiment with our dataset.
Memory Compression comparison
from sklearn.model_selection import train_test_split#train test split
xtrain,xtest,ytrain,ytest = ( train_test_split(X,y,test_size=0.3,random_state=1997))#plot compressed memory vs original memory
get_mem_usage(xtrain,xtest)
It's a huge reduction. Roughly our dataset is compressed by a factor of 0.8. And remember, the sparsity of the dataset is also 0.88.
By doing this simple trick, we have reduced the memory usage of our dataset. Let's move to the model train time comparison.
Model training time comparison
Here I'm going to test popular machine learning algorithms with sklearn API.
LogisticRegression
GradientBoostingClassifier
LinearSVC
Kmeans
Reference