The article was first published by 若绾
Abstract#
In machine learning, data is crucial. Therefore, for any machine learning project, the management and processing of data are essential. Data management involves many aspects, including data collection, cleaning, storage, and processing. In this article, we will discuss how to use NumPy to save datasets for centralized management.
Saving Datasets with NumPy#
NumPy is a Python library for scientific computing. It provides a powerful multidimensional array object and a set of functions for manipulating these arrays. NumPy arrays can store different types of data, including numbers, strings, and boolean values. Therefore, they are an ideal choice for storing datasets.
Saving a Single Array#
If your dataset consists of a single array, we can use the save
function of NumPy.
numpy.save#
Function parameters:
file: file, str, or pathlib.Path
The file or filename where the data is saved. If file is a file object, the filename will not be changed. If file is a string or Path, a
.npy
extension will be appended to the filename if it does not already have one.arr: array_like
The array data to be saved.
allow_pickle: bool, optional
Allow saving object arrays using Python pickles. Reasons for disallowing pickles include security (loading pickled data can execute arbitrary code) and portability (pickled objects may not be loadable on different Python installations, for example, if the stored objects require library versions that are unavailable, and not all pickled data is compatible between Python 2 and Python 3). Default: True
fix_imports: bool, optional
Only useful on Python 3. If fix_imports is True, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.
For example, suppose we have a NumPy array named data
, we can save it to a file named data.npy
using the following code:
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.save('data.npy', data)
Loading Arrays from Files#
To load a dataset, we can use the load
function of NumPy. For example, the following code loads the array saved in the file named data.npy
:
import numpy as np
data = np.load('data.npy')
print(data)
The output will be:
[[1 2 3]
[4 5 6]
[7 8 9]]
Saving Multiple Arrays Simultaneously#
If your dataset consists of multiple arrays, such as train_set, train_label, test_set, test_label, you can use the savez
or savez_compressed
function of NumPy to save the dataset. numpy.savez
saves multiple arrays as an uncompressed .npz
file, while numpy.savez_compressed
saves the arrays as a compressed .npz
file, which can save storage space.
numpy.savez#
Function parameters:
file: str or file
The filename (string) or a file (file-like object) where the data will be saved. If file is a string or a Path, a
.npz
extension will be appended to the filename if it does not already have one.args: Arguments, optional
Arrays to be saved to the file. Please use keyword arguments (see below) to specify the names of the arrays. The arrays specified as args will be named "arr_0", "arr_1", and so on.
kwds: Keyword arguments, optional
Arrays to be saved to the file. Each array will be saved under the name of its corresponding keyword.
With kwds, the arrays will be saved under the names of the keywords. In this example, we create two NumPy arrays array1
and array2
, and then use numpy.savez
to save them to a file named arrays.npz
. **Note that we need to specify a keyword argument for each array, which will be the name of the array in the file.
import numpy as np
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
# Save arrays to file
np.savez('arrays.npz', arr1=array1, arr2=array2)
numpy.savez_compressed#
Function parameters:
file: str or file
The filename (string) or a file (file-like object) where the data will be saved. If file is a string or a Path, a
.npz
extension will be appended to the filename if it does not already have one.args: Arguments, optional
Arrays to be saved to the file. Please use keyword arguments (see below) to specify the names of the arrays. The arrays specified as args will be named "arr_0", "arr_1", and so on.
kwds: Keyword arguments, optional
Arrays to be saved to the file. Each array will be saved under the name of its corresponding keyword.
This example is similar to the previous one, but uses numpy.savez_compressed
to save the arrays as a compressed .npz
file.
import numpy as np
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
# Save arrays to compressed file
np.savez_compressed('compressed_arrays.npz', arr1=array1, arr2=array2)
Loading Arrays from Files#
To load arrays from an .npz
file, you can use the numpy.load
function:
import numpy as np
# Load saved arrays
loaded_arrays = np.load('arrays.npz')
# Access arrays by the names specified in the file
loaded_array1 = loaded_arrays['arr1']
loaded_array2 = loaded_arrays['arr2']
In this example, we use numpy.load
to load the file named arrays.npz
, and access the arrays within it by the names specified earlier. The same method applies to loading compressed .npz
files.
Conclusion#
In this article, we discussed how to use NumPy to save datasets for centralized management. This is an important aspect of data management that should be given due attention in any machine learning project.