[NumPy Tips] Using NumPy to Save and Centralize Management of Datasets in Machine Learning

The article was first published by 若绾

Abstract#

In machine learning, data is crucial. Therefore, for any machine learning project, the management and processing of data are essential. Data management involves many aspects, including data collection, cleaning, storage, and processing. In this article, we will discuss how to use NumPy to save datasets for centralized management.

Saving Datasets with NumPy#

NumPy is a Python library for scientific computing. It provides a powerful multidimensional array object and a set of functions for manipulating these arrays. NumPy arrays can store different types of data, including numbers, strings, and boolean values. Therefore, they are an ideal choice for storing datasets.

Saving a Single Array#

If your dataset consists of a single array, we can use the save function of NumPy.

numpy.save#

Function parameters:

file: file, str, or pathlib.Path

The file or filename where the data is saved. If file is a file object, the filename will not be changed. If file is a string or Path, a .npy extension will be appended to the filename if it does not already have one.

arr: array_like

The array data to be saved.

allow_pickle: bool, optional

Allow saving object arrays using Python pickles. Reasons for disallowing pickles include security (loading pickled data can execute arbitrary code) and portability (pickled objects may not be loadable on different Python installations, for example, if the stored objects require library versions that are unavailable, and not all pickled data is compatible between Python 2 and Python 3). Default: True

fix_imports: bool, optional

Only useful on Python 3. If fix_imports is True, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.

For example, suppose we have a NumPy array named data, we can save it to a file named data.npy using the following code:

import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.save('data.npy', data)

Loading Arrays from Files#

To load a dataset, we can use the load function of NumPy. For example, the following code loads the array saved in the file named data.npy:

import numpy as np

data = np.load('data.npy')
print(data)

The output will be:

[[1 2 3]
 [4 5 6]
 [7 8 9]]

Saving Multiple Arrays Simultaneously#

If your dataset consists of multiple arrays, such as train_set, train_label, test_set, test_label, you can use the savez or savez_compressed function of NumPy to save the dataset. numpy.savez saves multiple arrays as an uncompressed .npz file, while numpy.savez_compressed saves the arrays as a compressed .npz file, which can save storage space.

numpy.savez#

Function parameters:

file: str or file

The filename (string) or a file (file-like object) where the data will be saved. If file is a string or a Path, a .npz extension will be appended to the filename if it does not already have one.

args: Arguments, optional

Arrays to be saved to the file. Please use keyword arguments (see below) to specify the names of the arrays. The arrays specified as args will be named "arr_0", "arr_1", and so on.

kwds: Keyword arguments, optional

Arrays to be saved to the file. Each array will be saved under the name of its corresponding keyword.

With kwds, the arrays will be saved under the names of the keywords. In this example, we create two NumPy arrays array1 and array2, and then use numpy.savez to save them to a file named arrays.npz. **Note that we need to specify a keyword argument for each array, which will be the name of the array in the file.

import numpy as np

array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

# Save arrays to file
np.savez('arrays.npz', arr1=array1, arr2=array2)

numpy.savez_compressed #

Function parameters:

file: str or file

The filename (string) or a file (file-like object) where the data will be saved. If file is a string or a Path, a .npz extension will be appended to the filename if it does not already have one.

args: Arguments, optional

Arrays to be saved to the file. Please use keyword arguments (see below) to specify the names of the arrays. The arrays specified as args will be named "arr_0", "arr_1", and so on.

kwds: Keyword arguments, optional

Arrays to be saved to the file. Each array will be saved under the name of its corresponding keyword.

This example is similar to the previous one, but uses numpy.savez_compressed to save the arrays as a compressed .npz file.

import numpy as np

array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

# Save arrays to compressed file
np.savez_compressed('compressed_arrays.npz', arr1=array1, arr2=array2)

Loading Arrays from Files#

To load arrays from an .npz file, you can use the numpy.load function:

import numpy as np

# Load saved arrays
loaded_arrays = np.load('arrays.npz')

# Access arrays by the names specified in the file
loaded_array1 = loaded_arrays['arr1']
loaded_array2 = loaded_arrays['arr2']

In this example, we use numpy.load to load the file named arrays.npz, and access the arrays within it by the names specified earlier. The same method applies to loading compressed .npz files.

Conclusion#

In this article, we discussed how to use NumPy to save datasets for centralized management. This is an important aspect of data management that should be given due attention in any machine learning project.