The problem I am trying to solve has a list of intervals on the number line, each with a pre-defined score. I need to return the maximum possible total score.
The catch is that the intervals overlap, and of the overlapping intervals I can use only one. Here is an example.
Intervals - Score
0- 5 - 15
4- 9 - 18
10-15 - 12
8-21 - 19
25-30 - 25
Here, the intervals 0-5, 4-9 and 8-21 overlap.
The intervals 10-15 and 8-21 also overlap.
The maximum sum would be 55 (18+12+25).
It is important to note here that we select the interval 4-9 of the first batch of overlapping intervals even though it does not have the highest score of the three.
This is because selecting the interval 8-21 would prevent us from using the interval 10-15 later on, thereby reducing the overall sum (In that case, the overall sum would be 19+25=44).
I am looking for an O(nlogn) or O(n) solution to this problem. I think dynamic programming can be used, but I may be wrong. Could someone suggest a solution/algorithm(s) that could do the trick here?
Edit: The intervals are in no particular order.
This is a weighted variation of interval scheduling; it's solvable in O(N log N)
with dynamic programming.
Let an interval be g(start, stop, score)
, and let them be sorted by stop
. For simplicity, let's assume for now that all stop
is unique.
Let best[i]
be the best score we can get when we're allowed to use g[1], ..., g[i]
. We don't have to use them all, of course, and generally we can't because the subset of intervals we use must be non-overlapping.
best[0] = 0
. That is, since we can't use any interval, the best score we can get is 0.1 <= k <= N
, we have:
best[k] = max( best[k-1], best[j] + g[k].score )
, where
j
is the largest index such that g[j].stop < g[k].start
(j
may be zero)That is, given that we're allowed to use g[1], ... g[k]
, the best we can do is the better scoring of these two options:
g[k]
. Thus, the score of this option is best[k-1]
.
g[1], ... g[k-1]
g[k]
, and to its left we do the best we can with all the genes that don't overlap with g[k]
, i.e. all g[1], ..., g[j]
, where g[j].stop < g[k].start
and j
is as large as possible. Thus, the score of this option is best[j] + g[k].score
.(Note the optimal substructure and overlapping subproblems components of dynamic programming embodied in the above equation).
The overall answer to the question is best[N]
, i.e. the best score we can get when we're allowed to use all the genes. Oops, did I say genes? I mean intervals.
This is O(N log N)
because:
O(N log N)
j
for each k
is O(log N)
using binary searchIf several genes can have the same stop
values, then nothing changed: you still have to search for the rightmost j
. In e.g. Python this is easy with bisect_right
. In Java where the standard library binary search doesn't guarantee which index is returned in case of ties, you can (among many options) follow it with a linear search (for O(N)
worst-case performance), or another series of binary searches to find the right most index.
Oops did I say genes again? I mean intervals.