Radix Sort on an Array of Strings?

Kavix0 picture Kavix0 · Apr 13, 2014 · Viewed 9.4k times · Source

I've been researching around, and while I've figured out the general idea of using Radix Sort to alphabetize an array of strings, I know I'm going the wrong direction.

This is what I have so far:

void radixSort(string* sortMe, int l)
{
    queue<string>* sections = new queue<string>[27];    //Have a-z, and also one for strings that are null terminated.
    for(int i = 0; i < numElements; i++)
    {
        if(!(sortMe[i][l] == 32))
            sections[sortMe[i][l]-96].push(sortMe[i]);      //-96 because the ascii code for a is 97. If, for example a is the character, it will be placed at 1. 0 is left for null characters
    }

    for(int i =0; i < 26; i++)
    {
        while(!sections[i].empty())
        {
            temp.push_back(sections[i].front());
            sections[i].pop();
        }
    }
}

What I have so far sorts all the strings by the first character, and I know that I then have to go through and make subarrays of the remaining characters and sort them, but how can I implement it efficiently? The strings are of variable size and can include spaces, for example:

  • subdivides
  • main street
  • pants
  • impaled decolonizing
  • argillaceous
  • axial satisfactoriness
  • temperamental
  • hypersensitiveness
  • bears
  • hairbreadths
  • creams surges
  • unlaboured
  • hoosier
  • buggiest
  • mauritanians
  • emanators
  • acclaiming
  • zouaves dishpan
  • traipse
  • solarisms
  • remunerativeness
  • solubilizing
  • chiseled
  • jugular
  • ooziness
  • toastier
  • baud
  • suffixed
  • powerless tiding
  • disassimilated
  • gasps
  • flirtier
  • uh

This is something I found that seems like it will be of use: http://algs4.cs.princeton.edu/lectures/51DemoKeyIndexedCounting.pdf

Answer

iavr picture iavr · Apr 13, 2014

The slides you've found are great! But where did those queues come from in your code?

Anyway, here you are (live example):

template <typename E>
size_t bin(const E& elem, size_t digit)
{
    return elem.size() > digit ? size_t(elem[digit]) + 1 : 0;
}

template <size_t R, typename C, typename P>
void radix_sort(P& pos, const C& data, size_t digit)
{
    using A = std::array<size_t, R + 1>;
    A count = {};
    P prev(pos);

    for (auto i : prev)
        ++count[bin(data[i], digit)];

    A done = {}, offset = {{0}};
    std::partial_sum(count.begin(), count.end() - 1, offset.begin() + 1);

    for (auto i : prev)
    {
        size_t b = bin(data[i], digit);
        pos[offset[b] + done[b]++] = i;
    }
}

struct shorter
{
    template <typename A>
    bool operator()(const A& a, const A& b) { return a.size() < b.size(); }
};

template <size_t R, typename C>
std::vector<size_t> radix_sort(const C& data)
{
    std::vector<size_t> pos(data.size());
    std::iota(pos.begin(), pos.end(), 0);

    size_t width = std::max_element(data.begin(), data.end(), shorter())->size();

    for (long digit = long(width) - 1; digit >= 0; --digit)
        radix_sort<R>(pos, data, size_t(digit));

    return pos;
}

which you can use like that

int main()
{
    std::vector<std::string> data = generate();
    std::vector<size_t> pos = radix_sort<128>(data);
    for (auto i : pos)
        std::cout << data[i] << std::endl;
}

where generate() is included in the live example and generates the strings given in your question.

I am not trying to explain how this works here, I assume you can figure out since you are working on the problem. But a few comments are in order.

  • We are neither sorting the input sequence in-place, nor returning a sorted copy; we are just returning a sequence of positions of input elements in the sorted sequence.

  • We are processing strings from right to left.

  • The complexity is O(lw) where l is the input length (number of input strings) and w is the maximum input width (max. length of all input strings). So this algorithm makes sense if the string width does not vary too much.

  • The first template parameter R of radix_sort() is the number of possible values for each digit (letter) in the input. E.g. R = 128 means that possible values are 0..127. This should be fine for your input. I haven't tried to do anything clever with respect to ASCII codes, but you can customize function bin() for that.

  • In the output of bin(), value 0 is reserved to mean "we are past the end of this string". Such strings are placed before others that are still continuing.

  • I have tried to give self-explanatory names to variables and functions, and use standard library calls for common tasks where possible.

  • The code is generic, e.g. it can sort any random access container containing random access containers, not just vectors of strings.

  • I am using C++11 features here and there for convenience, but nothing is really necessary: one could easily do the same just with C++03.