最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。
我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。
Arrays.sort(array);
面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?
最近我参加了一个面试,面试官要求我“编写一个程序,从一个包含10亿个数字的数组中找出100个最大的数字”。
我只能给出一个蛮力解决方案,即以O(nlogn)时间复杂度对数组进行排序,并取最后100个数字。
Arrays.sort(array);
面试官正在寻找一个更好的时间复杂度,我尝试了几个其他的解决方案,但都没有回答他。有没有更好的时间复杂度解决方案?
当前回答
此代码用于在未排序数组中查找N个最大的数字。
#include <iostream>
using namespace std;
#define Array_Size 5 // No Of Largest Numbers To Find
#define BILLION 10000000000
void findLargest(int max[], int array[]);
int checkDup(int temp, int max[]);
int main() {
int array[BILLION] // contains data
int i=0, temp;
int max[Array_Size];
findLargest(max,array);
cout<< "The "<< Array_Size<< " largest numbers in the array are: \n";
for(i=0; i< Array_Size; i++)
cout<< max[i] << endl;
return 0;
}
void findLargest(int max[], int array[])
{
int i,temp,res;
for(int k=0; k< Array_Size; k++)
{
i=0;
while(i < BILLION)
{
for(int j=0; j< Array_Size ; j++)
{
temp = array[i];
res= checkDup(temp,max);
if(res == 0 && max[j] < temp)
max[j] = temp;
}
i++;
}
}
}
int checkDup(int temp, int max[])
{
for(int i=0; i<N_O_L_N_T_F; i++)
{
if(max[i] == temp)
return -1;
}
return 0;
}
这可能不是一个有效的方法,但可以完成工作。
希望这能有所帮助
其他回答
求n个元素中最大的m个元素,其中n >>> m
最简单的解决方案,每个人都应该很明显,就是简单地做m次冒泡排序算法。
然后打印出数组的最后n个元素。
它不需要外部数据结构,并且使用了一种大家都知道的算法。
运行时间估计为O(m*n)。到目前为止最好的答案是O(nlog (m)),所以这个解决方案对于小m来说并不显着昂贵。
我并不是说这不能改进,但这是迄今为止最简单的解决方案。
您可以使用快速选择算法在(按顺序)索引[十亿-101]处查找数字 然后遍历这些数字找出比这个数字更大的数。
array={...the billion numbers...}
result[100];
pivot=QuickSelect(array,billion-101);//O(N)
for(i=0;i<billion;i++)//O(N)
if(array[i]>=pivot)
result.add(array[i]);
该算法时间为:2 X O(N) = O(N)(平均情况性能)
Thomas Jungblut建议的第二个选择是:
使用堆构建最大堆将花费O(N),然后前100个最大的数字将在堆的顶部,所有你需要的是把它们从堆(100 X O(Log(N))。
该算法时间为:O(N) + 100 X O(Log(N)) = O(N)
The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.
I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?
在理论层面:在堆中插入需要多少比较。我知道它是O(log(n))但常数因子有多大呢?我 在机器级别:缓存和分支预测对堆插入和数组中线性搜索的执行时间有什么影响? 在实现级别:库或编译器提供的堆数据结构中隐藏了哪些额外成本?
我认为,在人们试图估计100个元素堆和100个元素数组的性能之间的真正区别之前,这些都是必须回答的一些问题。所以做一个实验并测量真实的表现是有意义的。
复杂度为O(N)
首先创建一个100个int的数组,将这个数组的第一个元素初始化为N个值的第一个元素, 用另一个变量CurrentBig来跟踪当前元素的索引
遍历N个值
if N[i] > M[CurrentBig] {
M[CurrentBig]=N[i]; ( overwrite the current value with the newly found larger number)
CurrentBig++; ( go to the next position in the M array)
CurrentBig %= 100; ( modulo arithmetic saves you from using lists/hashes etc.)
M[CurrentBig]=N[i]; ( pick up the current value again to use it for the next Iteration of the N array)
}
完成后,从CurrentBig中打印M数组100次模100:-) 对于学生:确保代码的最后一行在代码退出之前没有胜过有效数据
首先取1000个元素并将它们添加到一个max堆中。现在取出前最多100个元素并将其存储在某个地方。现在从文件中选择接下来的900个元素,并将它们与最后100个最高的元素一起添加到堆中。
一直重复这个过程,从堆中取出100个元素,从文件中添加900个元素。
从100个元素中最后选出的100个元素将从10亿个数字中选出最大的100个元素。