Why is shutil.rmtree() so slow?

tshepang picture tshepang · Mar 29, 2011 · Viewed 7.1k times · Source

I went to check how to remove a directory in Python, and was led to use shutil.rmtree(). It's speed surprised me, as compared to what I'd expect from a rm --recursive. Are there faster alternatives, short of using subprocess module?

Answer

sarnold picture sarnold · Mar 29, 2011

The implementation does a lot of extra processing:

def rmtree(path, ignore_errors=False, onerror=None):
    """Recursively delete a directory tree.

    If ignore_errors is set, errors are ignored; otherwise, if onerror
    is set, it is called to handle the error with arguments (func,
    path, exc_info) where func is os.listdir, os.remove, or os.rmdir;
    path is the argument to that function that caused it to fail; and
    exc_info is a tuple returned by sys.exc_info(). If ignore_errors
    is false and onerror is None, an exception is raised.

    """
    if ignore_errors:
         def onerror(*args):
              pass
    elif onerror is None:
         def onerror(*args):
              raise
    try:
         if os.path.islink(path):
              # symlinks to directories are forbidden, see bug #1669
              raise OSError("Cannot call rmtree on a symbolic link")
    except OSError:
         onerror(os.path.islink, path, sys.exc_info())
         # can't continue even if onerror hook returns
         return
    names = []
    try:
         names = os.listdir(path)
    except os.error, err:
         onerror(os.listdir, path, sys.exc_info())
    for name in names:
         fullname = os.path.join(path, name)
         try:
              mode = os.lstat(fullname).st_mode
         except os.error:
              mode = 0
         if stat.S_ISDIR(mode):
              rmtree(fullname, ignore_errors, onerror)
         else:
             try:
                 os.remove(fullname)
             except os.error, err:
                 onerror(os.remove, fullname, sys.exc_info())
    try:
         os.rmdir(path)
    except os.error:
         onerror(os.rmdir, path, sys.exc_info()) 

Note the os.path.join() used to create new filenames; string operations do take time. The rm(1) implementation instead uses the unlinkat(2) system call, which doesn't do any additional string operations. (And, in fact, saves the kernel from walking through an entire namei() just to find the common directory, over and over and over again. The kernel's dentry cache is good and useful, but that can still be a fair amount of in-kernel string manipulation and comparisons.) The rm(1) utility gets to bypass all that string manipulation, and just use a file descriptor for the directory.

Furthermore, both rm(1) and rmtree() check the st_mode of every file and directory in the tree; but the C implementation does not need to turn every struct statbuf into a Python object just to perform a simple integer mask operation. I don't know how long this process takes, but it happens once for every file, directory, pipe, symlink, etc. in the directory tree.