我想在PostgreSQL中随机选择行,我尝试了这个:
select * from table where random() < 0.01;
但也有人建议:
select * from table order by random() limit 1000;
我有一个非常大的表,有5亿行,我希望它是快速的。
哪种方法更好?有什么不同?选择随机行最好的方法是什么?
我想在PostgreSQL中随机选择行,我尝试了这个:
select * from table where random() < 0.01;
但也有人建议:
select * from table order by random() limit 1000;
我有一个非常大的表,有5亿行,我希望它是快速的。
哪种方法更好?有什么不同?选择随机行最好的方法是什么?
当前回答
select * from table order by random() limit 1000;
如果知道需要多少行,请检查tsm_system_rows。
tsm_system_rows
module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command. This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.
首先安装扩展
CREATE EXTENSION tsm_system_rows;
然后你的问题,
SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);
其他回答
ORDER BY的那个会比较慢。
Select * from table where random() < 0.01;逐条记录,然后决定是否随机过滤。这将是O(N)因为它只需要检查每个记录一次。
Select * from table order by random() limit 1000;将对整个表进行排序,然后选择前1000个。除去幕后的巫毒魔法,顺序是O(N * log N)。
random() < 0.01的缺点是,输出记录的数量是可变的。
注意,有一种比随机排序更好的方法来打乱一组数据:Fisher-Yates Shuffle,它在O(N)中运行。不过,在SQL中实现shuffle听起来很有挑战性。
我的经验告诉我:
offset floor(random() * N) limit 1并不比order by random() limit 1快。
我认为偏移量方法会更快,因为它可以节省在Postgres中排序的时间。事实证明并非如此。
我知道我有点晚了,但我刚刚找到了这个叫做pg_sample的很棒的工具:
pg_sample -从较大的PostgreSQL数据库中提取一个小的样本数据集,同时保持引用完整性。
我尝试了一个350M行的数据库,它真的很快,不知道随机性。
./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db
从PostgreSQL 9.5开始,有一个新的语法专门用于从表中获取随机元素:
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
这个例子将给出mytable中5%的元素。
有关文档的更多说明:http://www.postgresql.org/docs/current/static/sql-select.html
select * from table order by random() limit 1000;
如果知道需要多少行,请检查tsm_system_rows。
tsm_system_rows
module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command. This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.
首先安装扩展
CREATE EXTENSION tsm_system_rows;
然后你的问题,
SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);