从整数流中找到运行中位数

可能的重复: 滚动中值算法

假设整数是从数据流中读取的。以有效的方式查找到目前为止读取的元素的中位数。

我读过的解决方案:我们可以在左边使用max堆来表示小于有效中位数的元素，在右边使用min堆来表示大于有效中位数的元素。

在处理一个传入元素后，堆中的元素数量最多相差1个元素。当两个堆包含相同数量的元素时，我们发现堆根数据的平均值为有效中位数。当堆不平衡时，我们从包含更多元素的堆根中选择有效中值。

但是我们如何构造最大堆和最小堆也就是说，我们如何知道这里的有效中值?我认为我们应该在max-heap中插入1个元素然后在min-heap中插入下一个元素，如此类推。如果我说错了请指正。

当前回答

你不能只用一个堆来做这个吗?更新:没有。请看评论。

不变性:在读取2*n个输入后，最小堆保存其中最大的n个。

循环:读取2个输入。将它们都添加到堆中，并删除堆的最小值。这将重新建立不变量。

所以当读取了2n个输入时，堆的最小值是第n大的。在中间位置附近取两个元素的平均值，以及在奇数个输入之后处理查询，需要稍微复杂一点。

2012-05-21 21:12:22

其他回答

我可以确认@schmil-the-cat的答案是正确的。

下面是一个JS的实现。我不是算法专家，但我认为它可能对其他人有用。


class Heap {
  constructor(isMin) {
    this.heap = [];
    this.isMin = isMin;
  }

  heapify() {
    if (this.heap.length === 1) {
      return;
    }

    let currentIndex = this.heap.length - 1; 

    while (true) {
      if (currentIndex === 0) {
        break;
      }

      const parentIndex = Math.floor((currentIndex - 1) / 2);
      const parentValue = this.heap[parentIndex];
      const currentValue = this.heap[currentIndex];

      if (
        (this.isMin && parentValue < currentValue) ||
        (!this.isMin && parentValue > currentValue)
      ) {
        break;
      }

      this.heap[parentIndex] = currentValue;
      this.heap[currentIndex] = parentValue;

      currentIndex = parentIndex;
    }
  }

  insert(val) {
    this.heap.push(val);

    this.heapify();
  }

  pop() {
    const val = this.heap.shift();
    this.heapify();
    return val;
  }

  top() {
    return this.heap[0];
  }

  length() {
    return this.heap.length;
  }
}

function findMedian(arr) {
  const topHeap = new Heap(true);
  const bottomHeap = new Heap(false);

  const output = [];

  if (arr.length === 1) {
    return arr[0];
  }

  topHeap.insert(Math.max(arr[0], arr[1]));
  bottomHeap.insert(Math.min(arr[0], arr[1]));

  for (let i = 0; i < arr.length; i++) {
    const currentVal = arr[i];

    if (i === 0) {
      output.push(currentVal);
      continue;
    }

    if (i > 1) {
      if (currentVal < bottomHeap.top()) {
        bottomHeap.insert(currentVal);
      } else {
        topHeap.insert(currentVal);
      }
    }

    if (bottomHeap.length() - topHeap.length() > 1) {
      const bottomVal = bottomHeap.pop();
      topHeap.insert(bottomVal);
    }

    if (topHeap.length() - bottomHeap.length() > 1) {
      const topVal = topHeap.pop();
      bottomHeap.insert(topVal);
    }

    if (bottomHeap.length() === topHeap.length()) {
      output.push(Math.floor((bottomHeap.top() + topHeap.top()) / 2));
      continue;
    }

    if (bottomHeap.length() > topHeap.length()) {
      output.push(bottomHeap.top());
    } else {
      output.push(topHeap.top());
    }
  }

  return output;
}

2022-09-18 15:36:42

下面是我简单但有效的算法(c++)，用于从整数流中计算运行中值:

#include<algorithm>
#include<fstream>
#include<vector>
#include<list>

using namespace std;

void runningMedian(std::ifstream& ifs, std::ofstream& ofs, const unsigned bufSize) {
    if (bufSize < 1)
        throw exception("Wrong buffer size.");
    bool evenSize = bufSize % 2 == 0 ? true : false;
    list<int> q;
    vector<int> nums;
    int n;
    unsigned count = 0;
    while (ifs.good()) {
        ifs >> n;
        q.push_back(n);
        auto ub = std::upper_bound(nums.begin(), nums.end(), n);
        nums.insert(ub, n);
        count++;
        if (nums.size() >= bufSize) {
            auto it = std::find(nums.begin(), nums.end(), q.front());
            nums.erase(it);
            q.pop_front();
            if (evenSize)
                ofs << count << ": " << (static_cast<double>(nums[nums.size() / 2 - 1] +
                static_cast<double>(nums[nums.size() / 2]))) / 2.0 << '\n';
            else
                ofs << count << ": " << static_cast<double>(nums[nums.size() / 2]);
        }
    }
}

The bufferSize specifies the size of the numbers sequence, on which the running median must be calculated. When reading numbers from the input stream ifs the vector of the size bufferSize is maintained in sorted order. The median is calculated by taking the middle of the sorted vector, if bufferSize is odd, or the sum of the two middle elements divided by 2, when bufferSize is even. Additinally, I maintain a list of last bufferSize elements read from input. When a new element is added, I put it in the right place in sorted vector and remove from the vector the element added bufferSize steps before (the value of the element retained in the front of the list). In the same time I remove the old element from the list: every new element is placed on the back of the list, every old element is removed from the front. After reaching the bufferSize, both the list and the vector stop to grow, and every insertion of a new element is compensated be deletion of an old element, placed in the list bufferSize steps before. Note, I do not care, whether I remove from the vector exactly the element, placed bufferSize steps before, or just an element that has the same value. For the value of median it does not matter. All calculated median values are output in the output stream.

2020-08-23 23:58:28

一种直观的思考方法是如果你有一棵完全平衡的二叉搜索树，那么根就是中值元素，因为这里有相同数量的较大和较小的元素。现在，如果树没有满，情况就不一样了，因为上一层中会有元素缺失。

所以我们可以用中值和两棵平衡二叉树，一棵表示小于中值的元素，另一棵表示大于中值的元素。这两棵树必须保持相同的大小。

当我们从数据流中获得一个新整数时，我们将其与中位数进行比较。如果它大于中值，我们就把它加到右边的树上。如果两个树的大小相差超过1，我们删除右边树的最小元素，使其成为新的中值，并将旧的中值放在左边树中。更小的也一样。

2012-05-22 18:59:01

如果我们想要找到n个最近出现的元素的中值，这个问题有一个精确的解决方案，只需要将n个最近出现的元素保存在内存中。它速度快，规模大。

可索引skiplist支持O(ln n)插入、删除和任意元素的索引搜索，同时保持排序顺序。当再加上一个FIFO队列来跟踪第n个最古老的条目时，解决方案很简单:

class RunningMedian:
    'Fast running median with O(lg n) updates where n is the window size'

    def __init__(self, n, iterable):
        self.it = iter(iterable)
        self.queue = deque(islice(self.it, n))
        self.skiplist = IndexableSkiplist(n)
        for elem in self.queue:
            self.skiplist.insert(elem)

    def __iter__(self):
        queue = self.queue
        skiplist = self.skiplist
        midpoint = len(queue) // 2
        yield skiplist[midpoint]
        for newelem in self.it:
            oldelem = queue.popleft()
            skiplist.remove(oldelem)
            queue.append(newelem)
            skiplist.insert(newelem)
            yield skiplist[midpoint]

以下是完整工作代码的链接(一个易于理解的类版本和一个内联可索引的skiplist代码的优化生成器版本):

http://code.activestate.com/recipes/576930-efficient-running-median-using-an-indexable-skipli/ http://code.activestate.com/recipes/577073。

2012-05-22 05:36:47

我发现的最有效的计算流百分位数的方法是P²算法:Raj Jain, Imrich Chlamtac:不存储观测数据的动态计算分位数和直方图的P²算法。Commun。Acm 28(10): 1076-1085 (1985)

该算法易于实现，工作效果非常好。然而，这只是一个估计，所以要记住这一点。来自摘要:

A heuristic algorithm is proposed for dynamic calculation qf the median and other quantiles. The estimates are produced dynamically as the observations are generated. The observations are not stored; therefore, the algorithm has a very small and fixed storage requirement regardless of the number of observations. This makes it ideal for implementing in a quantile chip that can be used in industrial controllers and recorders. The algorithm is further extended to histogram plotting. The accuracy of the algorithm is analyzed.

2012-05-21 23:14:09

从整数流中找到运行中位数

推荐文章

最新文章

标签