计算列表差值

在Python中,计算两个列表之间的差值的最佳方法是什么?

例子

A = [1,2,3,4]
B = [2,5]


A - B = [1,3,4]
B - A = [5]
316939 次浏览

你可能想要使用set而不是list

如果顺序不重要,你可以简单地计算集差:

>>> set([1,2,3,4]) - set([2,5])
set([1, 4, 3])
>>> set([2,5]) - set([1,2,3,4])
set([5])

你可以做一个

list(set(A)-set(B))

而且

list(set(B)-set(A))

如果你不关心项目的顺序或重复,请使用set。如果需要,请使用列表理解:

>>> def diff(first, second):
second = set(second)
return [item for item in first if item not in second]


>>> diff(A, B)
[1, 3, 4]
>>> diff(B, A)
[5]
>>>

一个衬套:

diff = lambda l1,l2: [x for x in l1 if x not in l2]
diff(A,B)
diff(B,A)

或者:

diff = lambda l1,l2: filter(lambda x: x not in l2, l1)
diff(A,B)
diff(B,A)

上面的例子使计算差异的问题变得微不足道。假设排序或重复数据删除确实使计算差异变得更容易,但如果您的比较无法承担这些假设,那么您将需要一个diff算法的重要实现。请参阅python标准库中的difflib。

#! /usr/bin/python2
from difflib import SequenceMatcher


A = [1,2,3,4]
B = [2,5]


squeeze=SequenceMatcher( None, A, B )


print "A - B = [%s]"%( reduce( lambda p,q: p+q,
map( lambda t: squeeze.a[t[1]:t[2]],
filter(lambda x:x[0]!='equal',
squeeze.get_opcodes() ) ) ) )

或Python3…

#! /usr/bin/python3
from difflib import SequenceMatcher
from functools import reduce


A = [1,2,3,4]
B = [2,5]


squeeze=SequenceMatcher( None, A, B )


print( "A - B = [%s]"%( reduce( lambda p,q: p+q,
map( lambda t: squeeze.a[t[1]:t[2]],
filter(lambda x:x[0]!='equal',
squeeze.get_opcodes() ) ) ) ) )


输出:

A - B = [[1, 3, 4]]

Python 2.7.3(默认,2014年2月27日,19:58:35)- IPython 1.1.0 - timeit: (github要点)

def diff(a, b):
b = set(b)
return [aa for aa in a if aa not in b]


def set_diff(a, b):
return list(set(a) - set(b))


diff_lamb_hension = lambda l1,l2: [x for x in l1 if x not in l2]


diff_lamb_filter = lambda l1,l2: filter(lambda x: x not in l2, l1)


from difflib import SequenceMatcher
def squeezer(a, b):
squeeze = SequenceMatcher(None, a, b)
return reduce(lambda p,q: p+q, map(
lambda t: squeeze.a[t[1]:t[2]],
filter(lambda x:x[0]!='equal',
squeeze.get_opcodes())))

结果:

# Small
a = range(10)
b = range(10/2)


timeit[diff(a, b)]
100000 loops, best of 3: 1.97 µs per loop


timeit[set_diff(a, b)]
100000 loops, best of 3: 2.71 µs per loop


timeit[diff_lamb_hension(a, b)]
100000 loops, best of 3: 2.1 µs per loop


timeit[diff_lamb_filter(a, b)]
100000 loops, best of 3: 3.58 µs per loop


timeit[squeezer(a, b)]
10000 loops, best of 3: 36 µs per loop


# Medium
a = range(10**4)
b = range(10**4/2)


timeit[diff(a, b)]
1000 loops, best of 3: 1.17 ms per loop


timeit[set_diff(a, b)]
1000 loops, best of 3: 1.27 ms per loop


timeit[diff_lamb_hension(a, b)]
1 loops, best of 3: 736 ms per loop


timeit[diff_lamb_filter(a, b)]
1 loops, best of 3: 732 ms per loop


timeit[squeezer(a, b)]
100 loops, best of 3: 12.8 ms per loop


# Big
a = xrange(10**7)
b = xrange(10**7/2)


timeit[diff(a, b)]
1 loops, best of 3: 1.74 s per loop


timeit[set_diff(a, b)]
1 loops, best of 3: 2.57 s per loop


timeit[diff_lamb_filter(a, b)]
# too long to wait for


timeit[diff_lamb_filter(a, b)]
# too long to wait for


timeit[diff_lamb_filter(a, b)]
# TypeError: sequence index must be integer, not 'slice'

@roman-bodnarchuk列表推导函数Def diff(a, b)似乎更快。

如果你想要递归地深入到列表的项目中,我已经为python写了一个包:https://github.com/erasmose/deepdiff

安装

从PyPi安装:

pip install deepdiff

如果你是Python3,你还需要安装:

pip install future six

示例使用

>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function

同一对象返回空

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
{}

项目类型发生变化

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
{'type_changes': ["root[2]: 2=<type 'int'> vs. 2=<type 'str'>"]}

某项的值已更改

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
{'values_changed': ['root[2]: 2 ====>> 4']}

项目添加和/或删除

>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes)
{'dic_item_added': ['root[5, 6]'],
'dic_item_removed': ['root[4]'],
'values_changed': ['root[2]: 2 ====>> 4']}

字符串的区别

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
{ 'values_changed': [ 'root[2]: 2 ====>> 4',
"root[4]['b']:\n--- \n+++ \n@@ -1 +1 @@\n-world\n+world!"]}
>>>
>>> print (ddiff.changes['values_changed'][1])
root[4]['b']:
---
+++
@@ -1 +1 @@
-world
+world!

字符串差2

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!\nGoodbye!\n1\n2\nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n1\n2\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
{ 'values_changed': [ "root[4]['b']:\n--- \n+++ \n@@ -1,5 +1,4 @@\n-world!\n-Goodbye!\n+world\n 1\n 2\n End"]}
>>>
>>> print (ddiff.changes['values_changed'][0])
root[4]['b']:
---
+++
@@ -1,5 +1,4 @@
-world!
-Goodbye!
+world
1
2
End

类型变化

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n\n\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
{ 'type_changes': [ "root[4]['b']: [1, 2, 3]=<type 'list'> vs. world\n\n\nEnd=<type 'str'>"]}

列表的区别

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
{ 'list_removed': ["root[4]['b']: [3]"]}

区别2:注意它不考虑顺序

>>> # Note that it DOES NOT take order into account
... t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
{ }

包含字典的列表:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
'values_changed': ["root[4]['b'][2][1]: 1 ====>> 3"]}
A = [1,2,3,4]
B = [2,5]


#A - B
x = list(set(A) - set(B))
#B - A
y = list(set(B) - set(A))


print x
print y

当查看in -operator的TimeComplexity时,在最坏的情况下,它与O(n)一起工作。即使是集合。

因此,当比较两个数组时,最好情况下的TimeComplexity为O(n),最坏情况下为O(n²)。

另一种(但不幸的是更复杂)解决方案,在最好和最坏的情况下都适用于O(n):

# Compares the difference of list a and b
# uses a callback function to compare items
def diff(a, b, callback):
a_missing_in_b = []
ai = 0
bi = 0


a = sorted(a, callback)
b = sorted(b, callback)


while (ai < len(a)) and (bi < len(b)):


cmp = callback(a[ai], b[bi])
if cmp < 0:
a_missing_in_b.append(a[ai])
ai += 1
elif cmp > 0:
# Item b is missing in a
bi += 1
else:
# a and b intersecting on this item
ai += 1
bi += 1


# if a and b are not of same length, we need to add the remaining items
for ai in xrange(ai, len(a)):
a_missing_in_b.append(a[ai])




return a_missing_in_b

如。

>>> a=[1,2,3]
>>> b=[2,4,6]
>>> diff(a, b, cmp)
[1, 3]

字典列表的情况下,当set解决方案引发时,完整的列表推导解决方案工作

TypeError: unhashable type: 'dict'

测试用例

def diff(a, b):
return [aa for aa in a if aa not in b]


d1 = {"a":1, "b":1}
d2 = {"a":2, "b":2}
d3 = {"a":3, "b":3}


>>> diff([d1, d2, d3], [d2, d3])
[{'a': 1, 'b': 1}]
>>> diff([d1, d2, d3], [d1])
[{'a': 2, 'b': 2}, {'a': 3, 'b': 3}]

最简单的方法,

使用设置().difference(设置())

list_a = [1,2,3]
list_b = [2,3]
print set(list_a).difference(set(list_b))

答案是set([1])

简单的代码,让你与多个项目的差异,如果你想要:

a=[1,2,3,3,4]
b=[2,4]
tmp = copy.deepcopy(a)
for k in b:
if k in tmp:
tmp.remove(k)
print(tmp)

添加一个答案来处理我们想要严格区别重复的情况,也就是说,在第一个列表中有我们想要保留在结果中的重复。例如,得到,

[1, 1, 1, 2] - [1, 1] --> [1, 2]

我们可以用一个额外的计数器来得到一个优雅的差分函数。

from collections import Counter


def diff(first, second):
secondCntr = Counter(second)
second = set(second)
res = []
for i in first:
if i not in second:
res.append(i)
elif i in secondCntr:
if secondCntr[i] > 0:
secondCntr[i] -= 1
else:
res.append(i)
return res

在这个线程中,我没有看到保留a中的重复的解决方案。当a中的一个元素与B中的一个元素匹配时,这个元素必须在B中删除,这样当相同的元素在a中再次出现时,如果这个元素在B中只出现一次,那么它必须出现在差异中。

def diff(first, second):
l2 = list(second)
l3 = []
for el in first:
if el in l2:
l2.remove(el)
else:
l3 += [el]
return l3


l1 = [1, 2, 1, 3, 4]
l2 = [1, 2, 3, 3]
diff(l1, l2)
>>> [1, 4]

有三种选择,其中两种是可以接受的,另一种不应该这样做。

在较高的级别上,这3个选项是:

  1. 减去两个集合最好(有时)
  2. 检查每个列表项是否存在于(大多数时候最好)集合中
  3. 检查每个列表项是否存在于列表(不做)

选项3)永远不应该超过选项2)。根据应用程序的需要,您可能更喜欢选项1)或2),而在大多数用例中,2)可能是首选方法。2)与1)的性能非常相似,因为两者都具有O(m + n)时间复杂度。相比之下,2)在空间复杂度上比1)有边际优势,并且既保持了原始列表的顺序,又保持了原始列表中的任何重复。

如果你想删除重复,不关心顺序,那么1)可能是最适合你的。

import time


def fun1(l1, l2):
# Order and duplications in l1 are both lost, O(m) + O(n)
return set(l1) - set(l2)


def fun2(l1, l2):
# Order and duplications in l1 are both preserved, O(m) + O(n)
l2_set = set(l2)
return [item for item in l1 if item not in l2_set]


def fun3(l1, l2):
# Order and duplications in l1 are both preserved, O(m * n)
# Don't do
return [item for item in l1 if item not in l2]


A = list(range(7500))
B = list(range(5000, 10000))


loops = 100


start = time.time()
for _ in range(loops):
fun1(A, B)
print(f"fun1 time: {time.time() - start}")


start = time.time()
for _ in range(loops):
fun2(A, B)
print(f"fun2 time: {time.time() - start}")


start = time.time()
for _ in range(loops):
fun3(A, B)
print(f"fun3 time: {time.time() - start}")
fun1 time: 0.03749704360961914
fun2 time: 0.04109621047973633
fun3 time: 32.55076885223389

如果你的顺序不重要这两个集合都可以被散列,你可以在集合中使用对称差分

这将返回集合A或集合B中出现的值,但不会同时出现。

例如,这个问题显示了在列表 A和列表 B上执行的差值的返回值。

如果我们要(将两个列表转换为集合并)执行对称差分,我们将在一次操作中得到两者的合并结果。

A = [1,2,3,4]
B = [2,5]
print(set(A) ^ set(B)


# {1, 3, 4, 5}

添加这个答案,因为我还没有看到现有答案中提供的对称差异