I have a command line tool (actually, several) that I am writing a wrapper for in Python.
The tool is generally used like this:
$ path_to_tool -option1 -option2 > file_out
The user gets the output written to file_out, and is also able to see various status messages of the tool as it is running.
I want to replicate this behavior, while also logging stderr (the status messages) to a file.
What I have is this:
from subprocess import call
call(['path_to_tool','-option1','option2'], stdout = file_out, stderr = log_file)
This works fine EXCEPT that stderr is not written to the screen. I can add code to print the contents of the log_file to the screen of course, but then the user will see it after everything is done rather than while it is happening.
To recap, desired behavior is:
I have a feeling I'm either missing something really simple, or this is much more complicated than I thought...thanks for any help!
EDIT: this only needs to work on Linux.
You can do this with subprocess
, but it's not trivial. If you look at the Frequently Used Arguments in the docs, you'll see that you can pass PIPE
as the stderr
argument, which creates a new pipe, passes one side of the pipe to the child process, and makes the other side available to use as the stderr
attribute.*
So, you will need to service that pipe, writing to the screen and to the file. In general, getting the details right for this is very tricky.** In your case, there's only one pipe, and you're planning on servicing it synchronously, so it's not that bad.
import subprocess
proc = subprocess.Popen(['path_to_tool', '-option1', 'option2'],
stdout=file_out, stderr=subprocess.PIPE)
for line in proc.stderr:
sys.stdout.write(line)
log_file.write(line)
proc.wait()
(Note that there are some issues using for line in proc.stderr:
—basically, if what you're reading turns out not to be line-buffered for any reason, you can sit around waiting for a newline even though there's actually half a line worth of data to process. You can read chunks at a time with, say, read(128)
, or even read(1)
, to get the data more smoothly if necessary. If you need to actually get every byte as soon as it arrives, and can't afford the cost of read(1)
, you'll need to put the pipe in non-blocking mode and read asynchronously.)
But if you're on Unix, it might be simpler to use the tee
command to do it for you.
For a quick&dirty solution, you can use the shell to pipe through it. Something like this:
subprocess.call('path_to_tool -option1 option2 2|tee log_file 1>2', shell=True,
stdout=file_out)
But I don't want to debug shell piping; let's do it in Python, as shown in the docs:
tool = subprocess.Popen(['path_to_tool', '-option1', 'option2'],
stdout=file_out, stderr=subprocess.PIPE)
tee = subprocess.Popen(['tee', 'log_file'], stdin=tool.stderr)
tool.stderr.close()
tee.communicate()
Finally, there are a dozen or more higher-level wrappers around subprocesses and/or the shell on PyPI—sh
, shell
, shell_command
, shellout
, iterpipes
, sarge
, cmd_utils
, commandwrapper
, etc. Search for "shell", "subprocess", "process", "command line", etc. and find one you like that makes the problem trivial.
What if you need to gather both stderr and stdout?
The easy way to do it is to just redirect one to the other, as Sven Marnach suggests in a comment. Just change the Popen
parameters like this:
tool = subprocess.Popen(['path_to_tool', '-option1', 'option2'],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
And then everywhere you used tool.stderr
, use tool.stdout
instead—e.g., for the last example:
tee = subprocess.Popen(['tee', 'log_file'], stdin=tool.stdout)
tool.stdout.close()
tee.communicate()
But this has some tradeoffs. Most obviously, mixing the two streams together means you can't log stdout to file_out and stderr to log_file, or copy stdout to your stdout and stderr to your stderr. But it also means the ordering can be non-deterministic—if the subprocess always writes two lines to stderr before writing anything to stdout, you might end up getting a bunch of stdout between those two lines once you mix the streams. And it means they have to share stdout's buffering mode, so if you were relying on the fact that linux/glibc guarantees stderr to be line-buffered (unless the subprocess explicitly changes it), that may no longer be true.
If you need to handle the two processes separately, it gets more difficult. Earlier, I said that servicing the pipe on the fly is easy as long as you only have one pipe and can service it synchronously. If you have two pipes, that's obviously no longer true. Imagine you're waiting on tool.stdout.read()
, and new data comes in from tool.stderr
. If there's too much data, it can cause the pipe to overflow and the subprocess to block. But even if that doesn't happen, you obviously won't be able to read and log the stderr data until something comes in from stdout.
If you use the pipe-through-tee
solution, that avoids the initial problem… but only by creating a new project that's just as bad. You have two tee
instances, and while you're calling communicate
on one, the other one is sitting around waiting forever.
So, either way, you need some kind of asynchronous mechanism. You can do this is with threads, a select
reactor, something like gevent
, etc.
Here's a quick and dirty example:
proc = subprocess.Popen(['path_to_tool', '-option1', 'option2'],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
def tee_pipe(pipe, f1, f2):
for line in pipe:
f1.write(line)
f2.write(line)
t1 = threading.Thread(target=tee_pipe, args=(proc.stdout, file_out, sys.stdout))
t2 = threading.Thread(target=tee_pipe, args=(proc.stderr, log_file, sys.stderr))
t3 = threading.Thread(proc.wait)
t1.start(); t2.start(); t3.start()
t1.join(); t2.join(); t3.join()
However, there are some edge cases where that won't work. (The problem is the order in which SIGCHLD and SIGPIPE/EPIPE/EOF arrive. I don't think any of that will affect us here, since we're not sending any input… but don't trust me on that without thinking it through and/or testing.) The subprocess.communicate
function from 3.3+ gets all the fiddly details right. But you may find it a lot simpler to use one of the async-subprocess wrapper implementations you can find on PyPI and ActiveState, or even the subprocess stuff from a full-fledged async framework like Twisted.
* The docs don't really explain what pipes are, almost as if they expect you to be an old Unix C hand… But some of the examples, especially in the Replacing Older Functions with the subprocess
Module section, show how they're used, and it's pretty simple.
** The hard part is sequencing two or more pipes properly. If you wait on one pipe, the other may overflow and block, preventing your wait on the other one from ever finishing. The only easy way to get around this is to create a thread to service each pipe. (On most *nix platforms, you can use a select
or poll
reactor instead, but making that cross-platform is amazingly difficult.) The source to the module, especially communicate
and its helpers, shows how to do it. (I linked to 3.3, because in earlier versions, communicate
itself gets some important things wrong…) This is why, whenever possible, you want to use communicate
if you need more than one pipe. In your case, you can't use communicate
, but fortunately you don't need more than one pipe.