Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?

Andrew Sturges picture Andrew Sturges · Jun 4, 2015 · Viewed 9.7k times · Source

It seems like the nature of the MapReduce framework is to work with many files. So when I get errors that tell me I'm using too many files, I suspect I'm doing something wrong.

If I run the job with the inline runner and three directories, it works:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

But if I run it using the local runner (and the same three directories), it fails:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/

[...output clipped...]

> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
    self._invoke_step(step_num, 'mapper')
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
    working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
    procs_args, output_path, working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
    cwd=working_dir, env=env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
    proc = Popen(args, **proc_kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
    errpipe_read, errpipe_write = self.pipe_cloexec()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files

Furthermore, if I go back to using the inline runner and include even more directories (11 total) in my input, then I get a different error again:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

[...clipped...]

Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run 
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run 
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
    self._invoke_sort(self._step_input_paths(), sort_output_path)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
    check_call(args, stdout=output, stderr=err, env=env)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long

The mrjob docs include a discussion of the differences between the inline and local runners, but I don't understand how it would explain this behavior.

Lastly, I'll mention that the number of files in the directories I'm globbing isn't huge (acknowledgement):

$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do   printf "%-25.25s : " "$dir";   find "$dir" -type f | wc -l; done | sort
./01                      :      236
./02                      :      169
./03                      :      176
./04                      :      185
./05                      :      176
./06                      :      235
./07                      :      275
./08                      :      265
./09                      :      186
./10                      :      171
./11                      :      161

I don't think this has to do with the job itself, but here it is:

from mrjob.job import MRJob
import numpy as np
import geohash

class MRGPSQuality(MRJob):

    def mapper(self, _, line):

        try:
            lat = float(line.split(',')[1])
            lng = float(line.split(',')[2])
            horizontalAccuracy = float(line.split(',')[4])
            gh = geohash.encode(lat, lng, precision=7)
            yield gh, horizontalAccuracy
        except:
            pass

    def reducer(self, key, values):
        # Convert the generator straight back to array:
        vals = np.fromiter(values, float)
        count = len(vals)
        mean = np.mean(vals)
        if count > 50:
            yield key, [count, mean]

if __name__ == '__main__':
    MRGPSQuality.run()

Answer

Brent picture Brent · Feb 22, 2017

The problem for "Argument list too long" is not the job or python, its bash. The asterisk in your command line to kick off the job expands out to every file that matches which is a really long command line and exceeds bash limit.

The error has nothing to do with ulimit but the error "Too many open files" is to do with ulimit, so you bump into the ulimit if the command were to actually run.

You can check the shells limit like this (if you are interested)... getconf ARG_MAX

To get around the max args problem, you can concatenate all the files into one by doing this.

for f in *; do cat "$f" >> ../directory/bigfile.log; done

Then run your mrjob pointed at the big file.

If its a lot of files you can use multiple threads to concat the file using gnu parallel because above command is single thread and slow.

ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"

*Change 8 to the amount of parallelism you want