假设我有两个表,l1和l2。我想执行l1 - l2,返回l1中不在l2中的所有元素。
我可以想出一个简单的循环方法来做这个,但那真的很低效。python式的高效方法是什么?
举个例子,如果l1 = [1,2,6,8], l2 = [2,3,5,8], l1 - l2应该返回[1,6]
假设我有两个表,l1和l2。我想执行l1 - l2,返回l1中不在l2中的所有元素。
我可以想出一个简单的循环方法来做这个,但那真的很低效。python式的高效方法是什么?
举个例子,如果l1 = [1,2,6,8], l2 = [2,3,5,8], l1 - l2应该返回[1,6]
Python有一个称为列表推导式的语言特性,它非常适合使这类事情变得极其简单。下面的语句完全是你想要的,并将结果存储在l3中:
l3 = [x for x in l1 if x not in l2]
L3将包含[1,6]。
一种方法是使用集合:
>>> set([1,2,6,8]) - set([2,3,5,8])
set([1, 6])
但是请注意,集合不会保留元素的顺序,并且会删除任何重复的元素。元素也需要是可哈希的。如果这些限制是可以容忍的,那么这通常是最简单和性能最高的选项。
使用Python set类型。这是最Pythonic的。:)
此外,由于它是原生的,它也应该是最优化的方法。
See:
http://docs.python.org/library/stdtypes.html#set
http://docs.python.org/library/sets.htm(适用于较旧的python)
# Using Python 2.7 set literal format.
# Otherwise, use: l1 = set([1,2,6,8])
#
l1 = {1,2,6,8}
l2 = {2,3,5,8}
l3 = l1 - l2
扩展Donut的答案和这里的其他答案,通过使用生成器推导式而不是列表推导式,以及使用集合数据结构(因为in操作符在列表上是O(n),而在集合上是O(1)),您可以得到更好的结果。
这里有一个函数适合你:
def filter_list(full_list, excludes):
s = set(excludes)
return (x for x in full_list if x not in s)
结果将是一个可迭代对象,它将惰性地获取过滤后的列表。如果你需要一个真正的列表对象(例如,如果你需要对结果执行len()),那么你可以很容易地像这样构建一个列表:
filtered_list = list(filter_list(full_list, excludes))
性能比较
在Python 3.9.1和Python 2.7.16上比较这里提到的所有答案的性能。
Python 3.9.1
答案按表现顺序列出:
Arkku's set difference using subtraction "-" operation - (91.3 nsec per loop) mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 5000000 loops, best of 5: 91.3 nsec per loop Moinuddin Quadri's using set().difference()- (133 nsec per loop) mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 2000000 loops, best of 5: 133 nsec per loop Moinuddin Quadri's list comprehension with set based lookup- (366 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 5: 366 nsec per loop Donut's list comprehension on plain list - (489 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 500000 loops, best of 5: 489 nsec per loop Daniel Pryden's generator expression with set based lookup and type-casting to list - (583 nsec per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup. mquadri$ mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 500000 loops, best of 5: 583 nsec per loop Moinuddin Quadri's using filter() and explicitly type-casting to list (need to explicitly type-cast as in Python 3.x, it returns iterator) - (681 nsec per loop) mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filter(lambda x: x not in l2, l1))" 500000 loops, best of 5: 681 nsec per loop Akshay Hazari's using combination of functools.reduce + filter -(3.36 usec per loop) : Explicitly type-casting to list as from Python 3.x it started returned returning iterator. Also we need to import functools to use reduce in Python 3.x mquadri$ python3 -m timeit "from functools import reduce; l1 = [1,2,6,8]; l2 = [2,3,5,8];" "list(reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2))" 100000 loops, best of 5: 3.36 usec per loop
Python 2.7.16
答案按表现顺序列出:
Arkku's set difference using subtraction "-" operation - (0.0783 usec per loop) mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 10000000 loops, best of 3: 0.0783 usec per loop Moinuddin Quadri's using set().difference()- (0.117 usec per loop) mquadri$ mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 10000000 loops, best of 3: 0.117 usec per loop Moinuddin Quadri's list comprehension with set based lookup- (0.246 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.246 usec per loop Donut's list comprehension on plain list - (0.372 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.372 usec per loop Moinuddin Quadri's using filter() - (0.593 usec per loop) mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "filter(lambda x: x not in l2, l1)" 1000000 loops, best of 3: 0.593 usec per loop Daniel Pryden's generator expression with set based lookup and type-casting to list - (0.964 per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup. mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 1000000 loops, best of 3: 0.964 usec per loop Akshay Hazari's using combination of functools.reduce + filter -(2.78 usec per loop) mquadri$ python -m timeit "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2)" 100000 loops, best of 3: 2.78 usec per loop
使用Set推导式{x for x in l2}或Set (l2)来获取Set,然后使用List推导式来获取List
l2set = set(l2)
l3 = [x for x in l1 if x not in l2set]
基准测试代码:
import time
l1 = list(range(1000*10 * 3))
l2 = list(range(1000*10 * 2))
l2set = {x for x in l2}
tic = time.time()
l3 = [x for x in l1 if x not in l2set]
toc = time.time()
diffset = toc-tic
print(diffset)
tic = time.time()
l3 = [x for x in l1 if x not in l2]
toc = time.time()
difflist = toc-tic
print(difflist)
print("speedup %fx"%(difflist/diffset))
基准测试结果:
0.0015058517456054688
3.968189239501953
speedup 2635.179227x
Python 3.8上的集合和列表理解基准
(加起来就是Moinuddin Quadri的基准)
tldr:使用Arkku的集合解决方案,相比之下比承诺的还要快!
根据列表检查现有文件
在我的例子中,我发现使用Arkku的集合解决方案要比python的列表理解方法快40倍(!)。
列表理解:
%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
[i for i in wanted if i not in existing]
墙壁时间:28.2秒
Sets
%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
set(wanted) - set(existing)
壁时间:689毫秒
使用set.difference ():
你可以使用set.difference()来获取新的set,其中包含集合中不存在于其他集合中的元素。即set(A).difference(B)将返回包含A中存在而B中不存在的项的set。例如:
>>> set([1,2,6,8]).difference([2,3,5,8])
{1, 6}
它是Arkku的答案中提到的一种函数方法(它使用算术减法-运算符来获得集差)。
因为集合是无序的,你将失去初始列表中元素的顺序。(如果你想保持元素的顺序,请继续阅读下一节)
使用列表理解和基于集的查找
如果你想保持从初始列表的顺序,那么甜甜圈的基于列表理解的答案将做的技巧。但是,通过在内部使用set来检查元素是否存在于其他列表中,可以从接受的答案中获得更好的性能。例如:
l1, l2 = [1,2,6,8], [2,3,5,8]
s2 = set(l2) # Type-cast `l2` to `set`
l3 = [x for x in l1 if x not in s2]
# ^ Doing membership checking on `set` s2
如果你有兴趣了解为什么与列表相比,set的成员资格检查更快,请阅读这篇文章:是什么使得set比list更快?
使用filter()和lambda表达式
下面是使用filter()和lambda表达式的另一种替代方法。在这里添加它只是为了参考,但它不是有效的性能:
>>> l1 = [1,2,6,8]
>>> l2 = set([2,3,5,8])
# v `filter` returns the a iterator object. Here I'm type-casting
# v it to `list` in order to display the resultant value
>>> list(filter(lambda x: x not in l2, l1))
[1, 6]
试试这个:
l1=[1,2,6,8]
l2=[2,3,5,8]
r=[]
for x in l1:
if x in l2:
continue
r=r+[x]
print(r)
使用filterfalse而不使用lambda-expression
When using functions like filter or filterfalse and similar from itertools you can usually save performance by avoiding lambda-expressions and using already existing functions. Instances of list and set defines a __contains__-method to use for containment checks. The in-operator calls this method under the hood, so using x in l2 can be replaced by l2.__contains__(x). Usually this replacement is not really prettier but in this specific case it allows us to gain better performance than using a lambda-expression, when used in combination with filterfalse:
>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = [2, 3, 5, 8]
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]
Filterfalse创建一个迭代器,该迭代器生成的所有元素在用作12.2 .__contains__的参数时返回false。
Sets有一个更快的__contains__实现,所以更好的是:
>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = set([2, 3, 5, 8])
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]
性能
使用列表:
$ python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
500000 loops, best of 5: 522 nsec per loop
使用设置:
$ python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
1000000 loops, best of 5: 359 nsec per loop
通过利用字典的有序属性来维持顺序(Python 3.7+)
注意:Python 3.6中字典的参考实现按照插入顺序维护键,但规范不保证这一点。对于3.7及更高版本,添加了这个保证。
字典的键作为一种集合;重复项被隐式过滤掉,由于散列,查找是高效的。因此,我们可以通过使用l1作为键来构建字典,然后删除出现在l2中的任何键来实现“set difference”。这维持了秩序并使用了一种快速的算法,但会产生相当数量的常量开销。
d = dict.fromkeys(l1)
for i in l2:
try:
del d[i]
except KeyError:
pass
l3 = list(d.keys())
如果你想要那种行为,集合方法是最好的。如果您不想删除列表l1中仅在l2中存在过一次的元素的所有实例,那么这些set操作将导致错误的结果。假设你在l1中有重复的元素,甚至在l2中也有重复的元素,并且想要两个列表l1 - l2的实际差值,同时保持其余元素的顺序:
l1 = [1, 2, 3, 4, 5, 5, 6, 5, 5, 2]
l2 = [1, 2, 2, 5]
_ = [l1.remove(item) for item in l2 if item in l1] # discard return value
print(l1) # [3, 4, 5, 6, 5, 5]
注意,这将明显比设置操作慢,只在用例需要时使用它 如果你不想修改原来的列表-先创建一个列表的副本