如何从MongoDB获得随机记录?

我想从一个巨大的集合(1亿条记录)中获得一个随机记录。

最快最有效的方法是什么?

数据已经在那里，没有字段可以生成随机数并获得随机行。

当前回答

有效可靠的方法是:

在每个文档中添加一个名为“random”的字段，并为其分配一个随机值，为该随机字段添加一个索引，如下所示:

让我们假设我们有一个名为“links”的网络链接集合，我们想从它中随机链接:

link = db.links.find().sort({random: 1}).limit(1)[0]

为了确保同一个链接不会第二次弹出，用一个新的随机数更新它的随机场:

db.links.update({random: Math.random()}, link)

2011-03-25 13:56:27

其他回答

使用Python (pymongo)，聚合函数也可以工作。

collection.aggregate([{'$sample': {'size': sample_size }}])

这种方法比对随机数(例如collection.find([random_int]))运行查询要快得多。对于大型收藏来说尤其如此。

2018-04-17 14:37:24

使用Map/Reduce，您当然可以获得一个随机记录，只是不一定非常有效，这取决于您最终使用的过滤集合的大小。

我已经用5万个文档测试了这个方法(过滤器将其减少到大约3万个)，它在Intel i3、16GB ram和SATA3 HDD上执行大约400毫秒……

db.toc_content.mapReduce(
    /* map function */
    function() { emit( 1, this._id ); },

    /* reduce function */
    function(k,v) {
        var r = Math.floor((Math.random()*v.length));
        return v[r];
    },

    /* options */
    {
        out: { inline: 1 },
        /* Filter the collection to "A"ctive documents */
        query: { status: "A" }
    }
);

Map函数简单地创建一个数组，其中包含所有与查询匹配的文档的id。在我的例子中，我测试了5万个可能的文档中的大约3万个。

Reduce函数只是在数组中从0到项数(-1)之间选择一个随机整数，然后从数组中返回该_id。

400ms听起来是一段很长的时间，而且确实如此，如果您有5000万条记录而不是5万条记录，这可能会增加开销，以至于在多用户情况下无法使用。

MongoDB在核心中包含这个功能有一个悬而未决的问题…https://jira.mongodb.org/browse/SERVER-533

如果将这种“随机”选择构建到索引查找中，而不是将id收集到一个数组中然后选择一个，这将非常有帮助。(去投票吧!)

2014-01-29 23:26:46

您还可以在执行查询后使用shuffle-array

Var shuffle = require('shuffle-array');

Accounts.find (qry函数(呃,results_array) { newIndexArr = shuffle (results_array);

2019-05-12 05:43:50

下面是一种使用_id的默认ObjectId值和一些数学和逻辑的方法。

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

这是shell表示法的一般逻辑，很容易适应。

所以在点上:

查找集合中的最小和最大主键值生成一个位于这些文档的时间戳之间的随机数。将随机数与最小值相加，然后找到大于或等于该值的第一个文档。

这使用了从“十六进制”的时间戳值中“填充”来形成有效的ObjectId值，因为这就是我们正在寻找的。使用整数作为_id值本质上更简单，但在点中基本思想相同。

2015-06-26 11:06:04

我最简单的解决办法是……

db.coll.find()
    .limit(1)
    .skip(Math.floor(Math.random() * 500))
    .next()

你至少有500件收藏品

2022-09-22 03:26:04

如何从MongoDB获得随机记录?

推荐文章

最新文章

标签