I'm testing subprocesses pipelines with python. I'm aware that I can do what the programs below do in python directly, but that's not the point. I just want to test the pipeline so I know how to use it.
My system is Linux Ubuntu 9.04 with default python 2.6.
I started with this documentation example.
from subprocess import Popen, PIPE
p1 = Popen(["grep", "-v", "not"], stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]
print output
That works, but since p1
's stdin
is not being redirected, I have to type stuff in the terminal to feed the pipe. When I type ^D
closing stdin, I get the output I want.
However, I want to send data to the pipe using a python string variable. First I tried writing on stdin:
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
p1.stdin.write('test\n')
output = p2.communicate()[0] # blocks forever here
Didn't work. I tried using p2.stdout.read()
instead on last line, but it also blocks. I added p1.stdin.flush()
and p1.stdin.close()
but it didn't work either. I Then I moved to communicate:
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
p1.communicate('test\n') # blocks forever here
output = p2.communicate()[0]
So that's still not it.
I noticed that running a single process (like p1
above, removing p2
) works perfectly. And passing a file handle to p1
(stdin=open(...)
) also works. So the problem is:
Is it possible to pass data to a pipeline of 2 or more subprocesses in python, without blocking? Why not?
I'm aware I could run a shell and run the pipeline in the shell, but that's not what I want.
UPDATE 1: Following Aaron Digulla's hint below I'm now trying to use threads to make it work.
First I've tried running p1.communicate on a thread.
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
t = threading.Thread(target=p1.communicate, args=('some data\n',))
t.start()
output = p2.communicate()[0] # blocks forever here
Okay, didn't work. Tried other combinations like changing it to .write()
and also p2.read()
. Nothing. Now let's try the opposite approach:
def get_output(subp):
output = subp.communicate()[0] # blocks on thread
print 'GOT:', output
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
t = threading.Thread(target=get_output, args=(p2,))
t.start()
p1.communicate('data\n') # blocks here.
t.join()
code ends up blocking somewhere. Either in the spawned thread, or in the main thread, or both. So it didn't work. If you know how to make it work it would make easier if you can provide working code. I'm trying here.
UPDATE 2
Paul Du Bois answered below with some information, so I did more tests.
I've read entire subprocess.py
module and got how it works. So I tried applying exactly that to code.
I'm on linux, but since I was testing with threads, my first approach was to replicate the exact windows threading code seen on subprocess.py
's communicate()
method, but for two processes instead of one. Here's the entire listing of what I tried:
import os
from subprocess import Popen, PIPE
import threading
def get_output(fobj, buffer):
while True:
chunk = fobj.read() # BLOCKS HERE
if not chunk:
break
buffer.append(chunk)
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
b = [] # create a buffer
t = threading.Thread(target=get_output, args=(p2.stdout, b))
t.start() # start reading thread
for x in xrange(100000):
p1.stdin.write('hello world\n') # write data
p1.stdin.flush()
p1.stdin.close() # close input...
t.join()
Well. It didn't work. Even after p1.stdin.close()
was called, p2.stdout.read()
still blocks.
Then I tried the posix code on subprocess.py
:
import os
from subprocess import Popen, PIPE
import select
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
numwrites = 100000
to_read = [p2.stdout]
to_write = [p1.stdin]
b = [] # create buffer
while to_read or to_write:
read_now, write_now, xlist = select.select(to_read, to_write, [])
if read_now:
data = os.read(p2.stdout.fileno(), 1024)
if not data:
p2.stdout.close()
to_read = []
else:
b.append(data)
if write_now:
if numwrites > 0:
numwrites -= 1
p1.stdin.write('hello world!\n'); p1.stdin.flush()
else:
p1.stdin.close()
to_write = []
print b
Also blocks on select.select()
. By spreading print
s around, I found out this:
p1.stdin
.numwrites
, p1.stdin.close()
is called.select()
starts blocking, only to_read
has something, p2.stdout
. to_write
is already empty.os.read()
call always returns something, so p2.stdout.close()
is never called.Conclusion from both tests: Closing the stdin
of the first process on the pipeline (grep
in the example) is not making it dump its buffered output to the next and die.
No way to make it work?
PS: I don't want to use a temporary file, I've already tested with files and I know it works. And I don't want to use windows.
I found out how to do it.
It is not about threads, and not about select().
When I run the first process (grep
), it creates two low-level file descriptors, one for each pipe. Lets call those a
and b
.
When I run the second process, b
gets passed to cut
sdtin
. But there is a brain-dead default on Popen
- close_fds=False
.
The effect of that is that cut
also inherits a
. So grep
can't die even if I close a
, because stdin is still open on cut
's process (cut
ignores it).
The following code now runs perfectly.
from subprocess import Popen, PIPE
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE, close_fds=True)
p1.stdin.write('Hello World\n')
p1.stdin.close()
result = p2.stdout.read()
assert result == "Hello Worl\n"
close_fds=True
SHOULD BE THE DEFAULT on unix systems. On windows it closes all fds, so it prevents piping.
EDIT:
PS: For people with a similar problem reading this answer: As pooryorick said in a comment, that also could block if data written to p1.stdin
is bigger than the buffers. In that case you should chunk the data into smaller pieces, and use select.select()
to know when to read/write. The code in the question should give a hint on how to implement that.
EDIT2: Found another solution, with more help from pooryorick - instead of using close_fds=True
and close ALL fds, one could close the fd
s that belongs to the first process, when executing the second, and it will work. The closing must be done in the child so the preexec_fn
function from Popen comes very handy to do just that. On executing p2 you can do:
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE, stderr=devnull, preexec_fn=p1.stdin.close)