It seems like the nature of the MapReduce framework is to work with many files. So when I get errors that tell me I'm using too many files, I suspect I'm doing something wrong.
If I run the job with the inline
runner and three directories, it works:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
But if I run it using the local
runner (and the same three directories), it fails:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/
[...output clipped...]
> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
self._invoke_step(step_num, 'mapper')
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
procs_args, output_path, working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
cwd=working_dir, env=env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
proc = Popen(args, **proc_kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
errpipe_read, errpipe_write = self.pipe_cloexec()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
r, w = os.pipe()
OSError: [Errno 24] Too many open files
Furthermore, if I go back to using the inline runner and include even more directories (11 total) in my input, then I get a different error again:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
[...clipped...]
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
self._invoke_sort(self._step_input_paths(), sort_output_path)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
check_call(args, stdout=output, stderr=err, env=env)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
retcode = call(*popenargs, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
return Popen(*popenargs, **kwargs).wait()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long
The mrjob docs include a discussion of the differences between the inline
and local
runners, but I don't understand how it would explain this behavior.
Lastly, I'll mention that the number of files in the directories I'm globbing isn't huge (acknowledgement):
$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do printf "%-25.25s : " "$dir"; find "$dir" -type f | wc -l; done | sort
./01 : 236
./02 : 169
./03 : 176
./04 : 185
./05 : 176
./06 : 235
./07 : 275
./08 : 265
./09 : 186
./10 : 171
./11 : 161
I don't think this has to do with the job itself, but here it is:
from mrjob.job import MRJob
import numpy as np
import geohash
class MRGPSQuality(MRJob):
def mapper(self, _, line):
try:
lat = float(line.split(',')[1])
lng = float(line.split(',')[2])
horizontalAccuracy = float(line.split(',')[4])
gh = geohash.encode(lat, lng, precision=7)
yield gh, horizontalAccuracy
except:
pass
def reducer(self, key, values):
# Convert the generator straight back to array:
vals = np.fromiter(values, float)
count = len(vals)
mean = np.mean(vals)
if count > 50:
yield key, [count, mean]
if __name__ == '__main__':
MRGPSQuality.run()
The problem for "Argument list too long" is not the job or python, its bash. The asterisk in your command line to kick off the job expands out to every file that matches which is a really long command line and exceeds bash limit.
The error has nothing to do with ulimit but the error "Too many open files" is to do with ulimit, so you bump into the ulimit if the command were to actually run.
You can check the shells limit like this (if you are interested)...
getconf ARG_MAX
To get around the max args problem, you can concatenate all the files into one by doing this.
for f in *; do cat "$f" >> ../directory/bigfile.log; done
Then run your mrjob pointed at the big file.
If its a lot of files you can use multiple threads to concat the file using gnu parallel because above command is single thread and slow.
ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"
*Change 8 to the amount of parallelism you want