编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

最近我参加了一个面试，面试官要求我“编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字”。

我只能给出一个蛮力解决方案，即以O(nlogn)时间复杂度对数组进行排序，并取最后100个数字。

Arrays.sort(array);

面试官正在寻找一个更好的时间复杂度，我尝试了几个其他的解决方案，但都没有回答他。有没有更好的时间复杂度解决方案?

当前回答

我用Python写了一个简单的解决方案，以防有人感兴趣。它使用bisect模块和一个临时返回列表，它保持排序。这类似于优先级队列实现。

import bisect

def kLargest(A, k):
    '''returns list of k largest integers in A'''
    ret = []
    for i, a in enumerate(A):
        # For first k elements, simply construct sorted temp list
        # It is treated similarly to a priority queue
        if i < k:
            bisect.insort(ret, a) # properly inserts a into sorted list ret
        # Iterate over rest of array
        # Replace and update return array when more optimal element is found
        else:
            if a > ret[0]:
                del ret[0] # pop min element off queue
                bisect.insort(ret, a) # properly inserts a into sorted list ret
    return ret

使用100,000,000个元素和最坏情况输入是一个排序列表:

>>> from so import kLargest
>>> kLargest(range(100000000), 100)
[99999900, 99999901, 99999902, 99999903, 99999904, 99999905, 99999906, 99999907,
 99999908, 99999909, 99999910, 99999911, 99999912, 99999913, 99999914, 99999915,
 99999916, 99999917, 99999918, 99999919, 99999920, 99999921, 99999922, 99999923,
 99999924, 99999925, 99999926, 99999927, 99999928, 99999929, 99999930, 99999931,
 99999932, 99999933, 99999934, 99999935, 99999936, 99999937, 99999938, 99999939,
 99999940, 99999941, 99999942, 99999943, 99999944, 99999945, 99999946, 99999947,
 99999948, 99999949, 99999950, 99999951, 99999952, 99999953, 99999954, 99999955,
 99999956, 99999957, 99999958, 99999959, 99999960, 99999961, 99999962, 99999963,
 99999964, 99999965, 99999966, 99999967, 99999968, 99999969, 99999970, 99999971,
 99999972, 99999973, 99999974, 99999975, 99999976, 99999977, 99999978, 99999979,
 99999980, 99999981, 99999982, 99999983, 99999984, 99999985, 99999986, 99999987,
 99999988, 99999989, 99999990, 99999991, 99999992, 99999993, 99999994, 99999995,
 99999996, 99999997, 99999998, 99999999]

我花了40秒计算1亿个元素，所以我不敢计算10亿个元素。为了公平起见，我给它提供了最坏情况的输入(具有讽刺意味的是，一个已经排序的数组)。

2013-10-07 20:28:14

其他回答

您可以使用快速选择算法在(按顺序)索引[十亿-101]处查找数字然后遍历这些数字找出比这个数字更大的数。

array={...the billion numbers...} 
result[100];

pivot=QuickSelect(array,billion-101);//O(N)

for(i=0;i<billion;i++)//O(N)
   if(array[i]>=pivot)
      result.add(array[i]);

该算法时间为:2 X O(N) = O(N)(平均情况性能)

Thomas Jungblut建议的第二个选择是:

使用堆构建最大堆将花费O(N)，然后前100个最大的数字将在堆的顶部，所有你需要的是把它们从堆(100 X O(Log(N))。

该算法时间为:O(N) + 100 X O(Log(N)) = O(N)

2013-10-07 15:13:08

虽然其他的quickselect解决方案已经被否决，但事实是quickselect将比使用大小为100的队列更快地找到解决方案。在比较方面，Quickselect的预期运行时间为2n + o(n)。一个非常简单的实现是

array = input array of length n
r = Quickselect(array,n-100)
result = array of length 100
for(i = 1 to n)
  if(array[i]>r)
     add array[i] to result

这平均需要3n + o(n)次比较。此外，quickselect将数组中最大的100个项保留在最右边的100个位置，这可以提高效率。所以实际上，运行时间可以提高到2n+o(n)。

有一个问题是，这是预期的运行时间，而不是最坏的情况，但通过使用一个不错的主元选择策略(例如，随机选择21个元素，并选择这21个元素的中位数作为主元)，那么比较的数量可以保证高概率为(2+c)n对于任意小的常数c。

事实上，通过使用优化的抽样策略(例如随机抽样平方根(n)个元素，并选择第99百分位数)，对于任意小的c(假设K，要选择的元素数量为o(n))，运行时间可以降至(1+c)n + o(n)。

另一方面，使用大小为100的队列将需要O(log(100)n)个比较，log以2为底100的对数大约等于6.6。

如果我们从更抽象的意义上考虑这个问题，即从大小为N的数组中选择最大的K个元素，其中K=o(N)，但K和N都趋于无穷大，那么快速选择版本的运行时间将是o(N)，队列版本的运行时间将是o(N log K)，因此在这种意义上，快速选择也渐近地更好。

在注释中，提到队列解决方案将在随机输入的预期时间N + K log N内运行。当然，随机输入假设永远不会成立，除非问题明确地说明了这一点。队列解决方案可以以随机顺序遍历数组，但这将产生对随机数生成器的N次调用的额外成本，以及排列整个输入数组或分配一个长度为N的包含随机索引的新数组。

如果问题不允许您移动原始数组中的元素，并且分配内存的成本很高，因此不能复制数组，那就是另一回事了。但严格地从运行时间来看，这是最好的解决方案。

2013-10-07 15:42:58

我意识到这被标记为“算法”，但会抛出一些其他选项，因为它可能也应该被标记为“面试”。

10亿个数字的来源是什么?如果它是一个数据库，那么“从表中按值顺序选择值desc limit 100”就可以很好地完成工作-可能有方言差异。

这是一次性的，还是会重复发生?如果重复，频率是多少?如果它是一次性的，数据在一个文件中，那么'cat srcfile | sort(根据需要选择)| head -100'将让你快速完成有偿工作，而计算机处理这些琐碎的琐事。

如果重复，你会建议选择任何合适的方法来获得初始答案并存储/缓存结果，这样你就可以连续地报告前100名。

Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.

祝你下次面试好运。

2013-10-08 22:09:02

我对此的直接反应是使用堆，但有一种方法可以使用QuickSelect，而不需要在任何时候保留所有的输入值。

创建一个大小为200的数组，并用前200个输入值填充它。运行QuickSelect并丢弃低100个位置，留下100个空闲位置。读入接下来的100个输入值并再次运行QuickSelect。继续执行，直到以100个批次为单位运行整个输入。

最后是前100个值。对于N个值，您运行QuickSelect大约N/100次。每个快速选择的代价大约是某个常数的200倍，所以总代价是某个常数的2N倍。在我看来，输入的大小是线性的，不管我在这个解释中硬连接的参数大小是100。

2013-10-07 18:50:36

如果在面试中被问到这个问题，面试官可能想看你解决问题的过程，而不仅仅是你的算法知识。

The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an integer array of size 200 (maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest numbers. By the way this algorithm is called counting sort.

无论如何，让问题更具体、更清楚对你在面试中是有好处的。

2013-10-08 18:04:09

编写一个程序，从一个包含10亿个数字的数组中找出100个最大的数字

推荐文章

最新文章

标签