Workaround on the threads limit in Grand Central Dispatch?

howanghk picture howanghk · Mar 1, 2013 · Viewed 9.9k times · Source

With Grand Central Dispatch, one can easily perform time consuming task on non-main thread, avoid blocking the main thead and keep the UI responsive. Simply by using dispatch_async and perform the task on a global concurrent queue.

dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
    // code
});

However, something sounds too good to be true like this one usually have their downside. After we use this a lot in our iOS app project, recently we found that there's a 64 thread limit on it. Once we hit the limit, the app will freeze / hang. By pausing the app with Xcode, we can see that the main thread is held by semaphore_wait_trap.

Googling on the web confirms that others are encountering this problem too but so far no solution found for this.

Dispatch Thread Hard Limit Reached: 64 (too many dispatch threads blocked in synchronous operations)

Another stackoverflow question confirms that this problem occur when using dispatch_sync and dispatch_barrier_async too.

Question:
As the Grand Central Dispatch have a 64 threads limit, is there any workaround for this?

Thanks in advance!

Answer

ipmcc picture ipmcc · Mar 1, 2013

Well, if you're bound and determined, you can free yourself of the shackles of GCD, and go forth and slam right up against the OS per-process thread limit using pthreads, but the bottom line is this: if you're hitting the queue-width limit in GCD, you might want to consider reevaluating your concurrency approach.

At the extremes, there are two ways you can hit the limit:

  1. You can have 64 threads blocked on some OS primitive via a blocking syscall. (I/O bound)
  2. You can legitimately have 64 runnable tasks all ready to rock at the same time. (CPU bound)

If you're in situation #1, then the recommended approach is to use non-blocking I/O. In fact, GCD has a whole bunch of calls, introduced in 10.7/Lion IIRC, that facilitate asynchronous scheduling of I/O and improve thread re-use. If you use the GCD I/O mechanism, then those threads won't be tied up waiting on I/O, GCD will just queue up your blocks (or functions) when data becomes available on your file descriptor (or mach port). See the documentation for dispatch_io_create and friends.

In case it helps, here's a little example (presented without warranty) of a TCP echo server implemented using the GCD I/O mechanism:

in_port_t port = 10000;
void DieWithError(char *errorMessage);

// Returns a block you can call later to shut down the server -- caller owns block.
dispatch_block_t CreateCleanupBlockForLaunchedServer()
{
    // Create the socket
    int servSock = -1;
    if ((servSock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
        DieWithError("socket() failed");
    }

    // Bind the socket - if the port we want is in use, increment until we find one that isn't
    struct sockaddr_in echoServAddr;
    memset(&echoServAddr, 0, sizeof(echoServAddr));
    echoServAddr.sin_family = AF_INET;
    echoServAddr.sin_addr.s_addr = htonl(INADDR_ANY);
    do {
        printf("server attempting to bind to port %d\n", (int)port);
        echoServAddr.sin_port = htons(port);
    } while (bind(servSock, (struct sockaddr *) &echoServAddr, sizeof(echoServAddr)) < 0 && ++port);

    // Make the socket non-blocking
    if (fcntl(servSock, F_SETFL, O_NONBLOCK) < 0) {
        shutdown(servSock, SHUT_RDWR);
        close(servSock);
        DieWithError("fcntl() failed");
    }

    // Set up the dispatch source that will alert us to new incoming connections
    dispatch_queue_t q = dispatch_queue_create("server_queue", DISPATCH_QUEUE_CONCURRENT);
    dispatch_source_t acceptSource = dispatch_source_create(DISPATCH_SOURCE_TYPE_READ, servSock, 0, q);
    dispatch_source_set_event_handler(acceptSource, ^{
        const unsigned long numPendingConnections = dispatch_source_get_data(acceptSource);
        for (unsigned long i = 0; i < numPendingConnections; i++) {
            int clntSock = -1;
            struct sockaddr_in echoClntAddr;
            unsigned int clntLen = sizeof(echoClntAddr);

            // Wait for a client to connect
            if ((clntSock = accept(servSock, (struct sockaddr *) &echoClntAddr, &clntLen)) >= 0)
            {
                printf("server sock: %d accepted\n", clntSock);

                dispatch_io_t channel = dispatch_io_create(DISPATCH_IO_STREAM, clntSock, q, ^(int error) {
                    if (error) {
                        fprintf(stderr, "Error: %s", strerror(error));
                    }
                    printf("server sock: %d closing\n", clntSock);
                    close(clntSock);
                });

                // Configure the channel...
                dispatch_io_set_low_water(channel, 1);
                dispatch_io_set_high_water(channel, SIZE_MAX);

                // Setup read handler
                dispatch_io_read(channel, 0, SIZE_MAX, q, ^(bool done, dispatch_data_t data, int error) {
                    BOOL close = NO;
                    if (error) {
                        fprintf(stderr, "Error: %s", strerror(error));
                        close = YES;
                    }

                    const size_t rxd = data ? dispatch_data_get_size(data) : 0;
                    if (rxd) {
                        // echo...
                        printf("server sock: %d received: %ld bytes\n", clntSock, (long)rxd);
                        // write it back out; echo!
                        dispatch_io_write(channel, 0, data, q, ^(bool done, dispatch_data_t data, int error) {});
                    }
                    else {
                        close = YES;
                    }

                    if (close) {
                        dispatch_io_close(channel, DISPATCH_IO_STOP);
                        dispatch_release(channel);
                    }
                });
            }
            else {
                printf("accept() failed;\n");
            }
        }
    });

    // Resume the source so we're ready to accept once we listen()
    dispatch_resume(acceptSource);

    // Listen() on the socket
    if (listen(servSock, SOMAXCONN) < 0) {
        shutdown(servSock, SHUT_RDWR);
        close(servSock);
        DieWithError("listen() failed");
    }

    // Make cleanup block for the server queue
    dispatch_block_t cleanupBlock = ^{
        dispatch_async(q, ^{
            shutdown(servSock, SHUT_RDWR);
            close(servSock);
            dispatch_release(acceptSource);
            dispatch_release(q);
        });
    };

    return Block_copy(cleanupBlock);
}

Anyway... back to the topic at hand:

If you're in situation #2, you should ask yourself, "Am I really gaining anything through this approach?" Let's say you have the most studly MacPro out there -- 12 cores, 24 hyperthreaded/virtual cores. With 64 threads, you've got an approx. 3:1 thread to virtual core ratio. Context switches and cache misses aren't free. Remember, we presumed that you weren't I/O bound for this scenario, so all you're doing by having more tasks than cores is wasting CPU time with context switches and cache thrash.

In reality, if your application is hanging because you've hit the queue width limit, then the most likely scenario is that you've starved your queue. You've likely created a dependency that reduces to a deadlock. The case I've seen most often is when multiple, interlocked threads are trying to dispatch_sync on the same queue, when there are no threads left. This is always fail.

Here's why: Queue width is an implementation detail. The 64 thread width limit of GCD is undocumented because a well-designed concurrency architecture shouldn't depend on the queue width. You should always design your concurrency architecture such that a 2 thread wide queue would eventually finish the job to the same result (if slower) as a 1000 thread wide queue. If you don't, there will always be a chance that your queue will get starved. Dividing your workload into parallelizable units should be opening yourself to the possibility of optimization, not a requirement for basic functioning. One way to enforce this discipline during development is to try working with a serial queue in places where you use concurrent queues, but expect non-interlocked behavior. Performing checks like this will help you catch some (but not all) of these bugs earlier.

Also, to the precise point of your original question: IIUC, the 64 thread limit is 64 threads per top-level concurrent queue, so if you really feel the need, you can use all three top level concurrent queues (Default, High and Low priority) to achieve more than 64 threads total. Please don't do this though. Fix your design such that it doesn't starve itself instead. You'll be happier. And anyway, as I hinted above, if you're starving out a 64 thread wide queue, you'll probably eventually just fill all three top level queues and/or run into the per-process thread limit and starve yourself that way too.