PyTorch Datasets: Converting entire Dataset to NumPy

Andrew Fan picture Andrew Fan · Feb 27, 2019 · Viewed 8.1k times · Source

I'm trying to convert the Torchvision MNIST train and test datasets into NumPy arrays but can't find documentation to actually perform the conversion.

My goal would be to take an entire dataset and convert it into a single NumPy array, preferably without iterating through the entire dataset.

I've looked at How do I turn a Pytorch Dataloader into a numpy array to display image data with matplotlib? but it doesn't address my issue.

So my question is, utilizing torch.utils.data.DataLoader, how would I go about converting the datasets (train/test) into two NumPy arrays such that all of the examples are present?

Note: I've left the batch size as the default of 1 for now; I could set it to 60,000 for train and 10,000 for test, but I'd prefer to not use magic numbers of that sort.

Thank you.

Answer

damavrom picture damavrom · Oct 21, 2019

There is no need to use torch.utils.data.DataLoader for this task.

from torchvision import datasets, transforms

train_set = datasets.MNIST('./data', train=True, download=True)
test_set = datasets.MNIST('./data', train=False, download=True)

train_set_array = train_set.data.numpy()
test_set_array = test_set.data.numpy()

Note that in this case targets are excluded.