A data structure problem

Given a sequence of integers, there are a number of queries. Each query has a range [l, r], and you are to find the median of the given range [l, r]

The number of queries can be as large as 100,000 The length of the sequence can be as large as 100,000

I wonder if there is any data structure can support such query

My solution:

I consult my partner today and he tells to use partition tree.

We can build a partition tree in nlog(n) time and answer each query in log(n) time

The partition tree actually is the process of merge sort, but for each node in the tree, it saves the number of integers that go to the left subtree. Thus, we can use this information to deal with the query.

here is my code:

This program is to find the x in a given interval [l, r], that minimize the following equation.

alt text

Explanation:

seq saves the sequence

pos saves the position after sort

ind saves the index

cntL saves the number of integers that go to the left tree in a given range

#include <cstdio>
#include <cstring>
#include <algorithm>
using namespace std;
#define N 100008
typedef long long LL;
int n, m, seq[N], ind[N], pos[N], next[N];
int cntL[20][N];
LL sum[20][N], sumL, subSum[N];

void build(int l, int r, int head, int dep)
{
    if (l == r)
    {
        cntL[dep][l] = cntL[dep][l-1];
        sum[dep][l] = sum[dep][l-1];
        return ;
    }
    int mid = (l+r)>>1;
    int hl = 0, hr = 0, tl = 0, tr = 0;
    for (int i = head, j = l; i != -1; i = next[i], j++)
    {
        cntL[dep][j] = cntL[dep][j-1];
        sum[dep][j] = sum[dep][j-1];
        if (pos[i] <= mid)
        {
            next[tl] = i;
            tl = i;
            if (hl == 0) hl = i;
            cntL[dep][j]++;
            sum[dep][j] += seq[i];
        }
        else
        {
            next[tr] = i;
            tr = i;
            if (hr == 0) hr = i;
        }
    }
    next[tl] = -1;
    next[tr] = -1;
    build(l, mid, hl, dep+1);
    build(mid+1, r, hr, dep+1);
}

int query(int left, int right, int ql, int qr, int kth, int dep)
{
    if (left == right)
    {
        return ind[left];
    }
    int mid = (left+right)>>1;
    if (cntL[dep][qr] - cntL[dep][ql-1] >= kth)
    {
        return query(left, mid, left+cntL[dep][ql-1]-cntL[dep][left-1], left+cntL[dep][qr]-cntL[dep][left-1]-1, kth, dep+1);
    }
    else
    {
        sumL += sum[dep][qr]-sum[dep][ql-1];
        return query(mid+1, right, mid+1+ql-left-(cntL[dep][ql-1]-cntL[dep][left-1]), mid+qr+1-left-(cntL[dep][qr]-cntL[dep][left-1]), \
                kth-(cntL[dep][qr]-cntL[dep][ql-1]), dep+1);
    }
}

inline int cmp(int x, int y)
{
    return seq[x] < seq[y];
}

int main()
{
    int ca, t, i, j, middle, ql, qr, id, tot;
    LL ans;
    scanf("%d", &ca);
    for (t = 1; t <= ca; t++)
    {
        scanf("%d", &n);
        subSum[0] = 0;
        for (i = 1; i <= n; i++) 
        {
            scanf("%d", seq+i);
            ind[i] = i;
            subSum[i] = subSum[i-1]+seq[i];
        }
        sort(ind+1, ind+1+n, cmp);
        for (i = 1; i <= n; i++)
        {
            pos[ind[i]] = i;
            next[i] = i+1;
        }
        next[n] = -1;
        build(1, n, 1, 0);
        printf("Case #%d:\n", t);
        scanf("%d", &m);
        while (m--)
        {
            scanf("%d%d", &ql, &qr);
            ql++, qr++;
            middle = (qr-ql+2)/2;
            sumL= 0;
            id = query(1, n, ql, qr, middle, 0);
            ans = subSum[qr]-subSum[ql-1]-sumL;
            tot = qr-ql+1;
            ans = ans-(tot-middle+1)*1ll*seq[id]+(middle-1)*1ll*seq[id]-sumL;
            printf("%lld\n", ans);
        }
        puts("");
    }
}

This is called the Range Median Query problem. The following paper might be relevant: Towards Optimal Range Medians. (Free link, thanks to belisarius).

From the abstract of the paper:

We consider the following problem: Given an unsorted array of n elements, and a sequence of intervals in the array, compute the median in each of the subarrays defined by the intervals. We describe a simple algorithm which needs O(nlogk+klogn) time to answer k such median queries. This improves previous algorithms by a logarithmic factor and matches a comparison lower bound for k=O(n). The space complexity of our simple algorithm is O(nlogn) in the pointer machine model, and O(n) in the RAM model. In the latter model, a more involved O(n) space data structure can be constructed in O(nlogn) time where the time per query is reduced to O(logn/loglogn). We also give efficient dynamic variants of both data structures, achieving O(log^2n) query time using O(nlogn) space in the comparison model and O((logn/loglogn)^2) query time using O(nlogn/loglogn) space in the RAM model, and show that in the cell-probe model, any data structure which supports updates in O(log^O(1)n) time must have Ω(logn/loglogn) query time.

Our approach naturally generalizes to higher-dimensional range median problems, where element positions and query ranges are multidimensional—it reduces a range median query to a logarithmic number of range counting queries.

Of course, you could preprocess the whole array in O(n^3) time (or perhaps even O(n^2logn) time) and O(n^2) space to be able to return the median in O(1) time.

Additional constraints might help simplify the solution. For instance, do we know that r-l will lesser than a known constant? etc...

ansaurus

tags:

views:

answers:

A data structure problem

related questions