How do you sort and efficiently find elements in a cell array (of strings) in Octave?

B Seven picture B Seven · Nov 22, 2011 · Viewed 12k times · Source

Is there built-in functionality for this?

Answer

Eric Leschinski picture Eric Leschinski · Dec 1, 2012

GNU Octave search a cell array of strings in linear time O(n):

The other answer has cellidx which is depreciated by octave, it still runs but they say to use ismember instead, like this:

%linear time string index search.
a = ["hello"; "unsorted"; "world"; "moobar"]
b = cellstr(a)
%b =
%{
%  [1,1] = hello
%  [2,1] = unsorted
%  [3,1] = world
%  [4,1] = moobar
%}
find(ismember(b, 'world'))   %returns 3

'world' is in index slot 3. This is a expensive linear time O(n) operation because it has to iterate through all elements whether or not it is found.

To achieve a logarathmic time O(log n) solution, then your list needs to come pre-sorted and then you can use binary search:

If your cell array is already sorted, you can do O(log-n) worst case:

function i = binsearch(array, val, low, high)
  %binary search algorithm for numerics, Usage:
  %myarray = [ 30, 40, 50.15 ];        %already sorted list
  %binsearch(myarray, 30,    1, 3)     %item 30 is in slot 1
  if ( high < low )
    i = 0;
  else
    mid = floor((low + high) / 2);
    if ( array(mid) > val )
      i = binsearch(array, val, low, mid-1);
    elseif ( array(mid) < val )
      i = binsearch(array, val, mid+1, high);
    else
      i = mid;
    endif
  endif
endfunction

function i = binsearch_str(array, val, low, high)
  % binary search for strings, usage:
  %myarray2 = [ "abc"; "def"; "ghi"];       #already sorted list
  %binsearch_str(myarray2, "abc", 1, 3)     #item abc is in slot 1
  if ( high < low )
    i = 0;
  else
    mid = floor((low + high) / 2);
    if ( mystrcmp(array(mid, [1:end]), val) == 1 )
      i = binsearch(array, val, low, mid-1);
    elseif ( mystrcmp(array(mid, [1:end]), val) == -1 )
      i = binsearch_str(array, val, mid+1, high);
    else
      i = mid;
    endif
  endif
endfunction
function ret = mystrcmp(a, b)
    %this function is just an octave string compare, it's exactly like 
    %strcmp(str1,str2) in C, or java.lang.String.compareTo(...) in Java.
    %returns 1 if string a > b
    %returns 0 if string a == b
    %return -1 if string a < b
    letters_gt = gt(a, b);      %list of boolean a > b
    letters_lt = lt(a, b);      %list of boolean a < b
    ret = 0;
    %octave makes us roll our own string compare because 
    %strings are arrays of numerics
    len = length(letters_gt);
    for i = 1:len
        if letters_gt(i) > letters_lt(i)
            ret = 1;
            return
        elseif letters_gt(i) < letters_lt(i)
            ret = -1;
            return
        endif
    end;
endfunction

%Assuming that myarray is already sorted, (it must be for binary 
%search to finish in logarithmic time `O(log-n))` worst case, then do

myarray = [ 30, 40, 50.15 ];        %already sorted list
binsearch(myarray, 30,    1, 3)     %item 30 is in slot 1
binsearch(myarray, 40,    1, 3)     %item 40 is in slot 2
binsearch(myarray, 50,    1, 3)     %50 does not exist so return 0
binsearch(myarray, 50.15, 1, 3)     %50.15 is in slot 3

%same but for strings:
myarray2 = [ "abc"; "def"; "ghi"];       %already sorted list
binsearch_str(myarray2, "abc", 1, 3)     %item abc is in slot 1
binsearch_str(myarray2, "def", 1, 3)     %item def is in slot 2
binsearch_str(myarray2, "zzz", 1, 3)     %zzz does not exist so return 0
binsearch_str(myarray2, "ghi", 1, 3)     %item ghi is in slot 3

To sort your array if it isn't already:

Complexity of sorting depends on the kind of data you have and whatever sorting algorithm GNU octave language writers selected, it's somewhere between O(n*log(n)) and O(n*n).

myarray = [ 9, 40, -3, 3.14, 20 ];        %not sorted list 
myarray = sort(myarray) 

myarray2 = [ "the"; "cat"; "sat"; "on"; "the"; "mat"];       %not sorted list 
myarray2 = sortrows(myarray2)

Don't be shy on the duct tape with this dob, it's the only thing holding the unit together.