Is there any possible optimization for random access on a very big array (I currently use uint8_t
, and I'm asking about what's better)
uint8_t MyArray[10000000];
when the value at any position in the array is
So, is there anything better than a uint8_t
array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.
I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude. It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.
A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)
) sorted array for the other ones.
When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).
If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.
#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>
using namespace std::chrono;
/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
/// This stuff allows to use this class wherever a library function
/// requires a UniformRandomBitGenerator (e.g. std::shuffle)
typedef uint32_t result_type;
static uint32_t min() { return 1; }
static uint32_t max() { return uint32_t(-1); }
/// PRNG state
uint32_t y;
/// Initializes with seed
XorShift32(uint32_t seed = 0) : y(seed) {
if(y == 0) y = 2463534242UL;
}
/// Returns a value in the range [1, 1<<32)
uint32_t operator()() {
y ^= (y<<13);
y ^= (y>>17);
y ^= (y<<15);
return y;
}
/// Returns a value in the range [0, limit); this conforms to the RandomFunc
/// requirements for std::random_shuffle
uint32_t operator()(uint32_t limit) {
return (*this)()%limit;
}
};
struct mean_variance {
double rmean = 0.;
double rvariance = 0.;
int count = 0;
void operator()(double x) {
++count;
double ormean = rmean;
rmean += (x-rmean)/count;
rvariance += (x-ormean)*(x-rmean);
}
double mean() const { return rmean; }
double variance() const { return rvariance/(count-1); }
double stddev() const { return std::sqrt(variance()); }
};
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
volatile unsigned out;
int main() {
XorShift32 xs;
std::vector<uint8_t> vec;
int size = 10000000;
for(int i = 0; i<size; ++i) {
uint32_t v = xs();
if(v < 1825361101) v = 0; // 42.5%
else if(v < 4080218931) v = 1; // 95.0%
else if(v < 4252017623) v = 2; // 99.0%
else {
while((v & 0xff) < 3) v = xs();
}
vec.push_back(v);
}
populate(vec.data(), vec.size());
mean_variance lk_t, arr_t;
for(int i = 0; i<50; ++i) {
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += lookup(xs() % size);
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "lookup: %10d µs\n", dur);
lk_t(dur);
}
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += vec[xs() % size];
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "array: %10d µs\n", dur);
arr_t(dur);
}
}
fprintf(stderr, " lookup | ± | array | ± | speedup\n");
printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
lk_t.mean(), lk_t.stddev(),
arr_t.mean(), arr_t.stddev(),
arr_t.mean()/lk_t.mean());
return 0;
}
(code and data always updated in my Bitbucket)
The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:
(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)
These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).
I compiled the code above with g++ 5.4.0 (-O3 -static
, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.
CPU | cache | lookup (µs) | array (µs) | speedup (x)
Xeon E5-1650 v3 @ 3.50GHz | 15360 KB | 60011 ± 3667 | 29313 ± 2137 | 0.49
Xeon E5-2697 v3 @ 2.60GHz | 35840 KB | 66571 ± 7477 | 33197 ± 3619 | 0.50
Celeron G1610T @ 2.30GHz | 2048 KB | 172090 ± 629 | 162328 ± 326 | 0.94
Core i3-3220T @ 2.80GHz | 3072 KB | 111025 ± 5507 | 114415 ± 2528 | 1.03
Core i5-7200U @ 2.50GHz | 3072 KB | 92447 ± 1494 | 95249 ± 1134 | 1.03
Xeon X3430 @ 2.40GHz | 8192 KB | 111303 ± 936 | 127647 ± 1503 | 1.15
Core i7 920 @ 2.67GHz | 8192 KB | 123161 ± 35113 | 156068 ± 45355 | 1.27
Xeon X5650 @ 2.67GHz | 12288 KB | 106015 ± 5364 | 140335 ± 6739 | 1.32
Core i7 870 @ 2.93GHz | 8192 KB | 77986 ± 429 | 106040 ± 1043 | 1.36
Core i7-6700 @ 3.40GHz | 8192 KB | 47854 ± 573 | 66893 ± 1367 | 1.40
Core i3-4150 @ 3.50GHz | 3072 KB | 76162 ± 983 | 113265 ± 239 | 1.49
Xeon X5650 @ 2.67GHz | 12288 KB | 101384 ± 796 | 152720 ± 2440 | 1.51
Core i7-3770T @ 2.50GHz | 8192 KB | 69551 ± 1961 | 128929 ± 2631 | 1.85
The results are... mixed!