I went to check how to remove a directory in Python, and was led to use shutil.rmtree(). It's speed surprised me, as compared to what I'd expect from a rm --recursive
. Are there faster alternatives, short of using subprocess module?
The implementation does a lot of extra processing:
def rmtree(path, ignore_errors=False, onerror=None):
"""Recursively delete a directory tree.
If ignore_errors is set, errors are ignored; otherwise, if onerror
is set, it is called to handle the error with arguments (func,
path, exc_info) where func is os.listdir, os.remove, or os.rmdir;
path is the argument to that function that caused it to fail; and
exc_info is a tuple returned by sys.exc_info(). If ignore_errors
is false and onerror is None, an exception is raised.
"""
if ignore_errors:
def onerror(*args):
pass
elif onerror is None:
def onerror(*args):
raise
try:
if os.path.islink(path):
# symlinks to directories are forbidden, see bug #1669
raise OSError("Cannot call rmtree on a symbolic link")
except OSError:
onerror(os.path.islink, path, sys.exc_info())
# can't continue even if onerror hook returns
return
names = []
try:
names = os.listdir(path)
except os.error, err:
onerror(os.listdir, path, sys.exc_info())
for name in names:
fullname = os.path.join(path, name)
try:
mode = os.lstat(fullname).st_mode
except os.error:
mode = 0
if stat.S_ISDIR(mode):
rmtree(fullname, ignore_errors, onerror)
else:
try:
os.remove(fullname)
except os.error, err:
onerror(os.remove, fullname, sys.exc_info())
try:
os.rmdir(path)
except os.error:
onerror(os.rmdir, path, sys.exc_info())
Note the os.path.join()
used to create new filenames; string operations do take time. The rm(1)
implementation instead uses the unlinkat(2)
system call, which doesn't do any additional string operations. (And, in fact, saves the kernel from walking through an entire namei()
just to find the common directory, over and over and over again. The kernel's dentry
cache is good and useful, but that can still be a fair amount of in-kernel string manipulation and comparisons.) The rm(1)
utility gets to bypass all that string manipulation, and just use a file descriptor for the directory.
Furthermore, both rm(1)
and rmtree()
check the st_mode
of every file and directory in the tree; but the C implementation does not need to turn every struct statbuf
into a Python object just to perform a simple integer mask operation. I don't know how long this process takes, but it happens once for every file, directory, pipe, symlink, etc. in the directory tree.