How to use very large dataset in RNN TensorFlow?

Question 1

How to use very large dataset in RNN TensorFlow?

pandas machine-learning tensorflow dataset data-processing

afagarap · Jul 25, 2017 · Viewed 8.3k times · Source

Answer

Answer

For large datasets - and we may already count 6.2GB as large - reading all the data in at once might not be the best idea. As you are going to train your network batch by batch anyway, it is sufficient to only load the data you need for the batch which is going to be used next.

The tensorflow documentation provides a good overview on how to implement a data reading pipeline. Stages according to the documentation linked are:

The list of filenames

Optional filename shuffling

Optional epoch limit

Filename queue

A Reader for the file format

A decoder for a record read by the reader

Optional preprocessing

Example queue

Question 2

I have a very large dataset: 7.9 GB of CSV files. 80% of which shall serve as the training data, and the remaining 20% shall serve as test data. When I'm loading the training data (6.2 GB), I'm having MemoryError at the 80th iteration (80th file). Here's the script I'm using in loading the data:

import pandas as pd
import os

col_names = ['duration', 'service', 'src_bytes', 'dest_bytes', 'count', 'same_srv_rate',
        'serror_rate', 'srv_serror_rate', 'dst_host_count', 'dst_host_srv_count',
        'dst_host_same_src_port_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
        'flag', 'ids_detection', 'malware_detection', 'ashula_detection', 'label', 'src_ip_add',
        'src_port_num', 'dst_ip_add', 'dst_port_num', 'start_time', 'protocol']

# create a list to store the filenames
files = []

# create a dataframe to store the contents of CSV files
df = pd.DataFrame()

# get the filenames in the specified PATH
for (dirpath, dirnames, filenames) in os.walk(path):
    ''' Append to the list the filenames under the subdirectories of the <path> '''
    files.extend(os.path.join(dirpath, filename) for filename in filenames)

for file in files:
    df = df.append(pd.read_csv(filepath_or_buffer=file, names=col_names, engine='python'))
    print('Appending file : {file}'.format(file=files[index]))

pd.set_option('display.max_colwidth', -1)
print(df)

There are 130 files in the 6.2 GB worth of CSV files.

How to use very large dataset in RNN TensorFlow?

Answer

Related questions