我有一个字符串列表,我想执行一个自然的字母排序。
例如,下面的列表是自然排序(我想要的):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
下面是上面列表的“排序”版本(我使用sorted()得到的):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
我在寻找一个排序函数它的行为和第一个一样。
我有一个字符串列表,我想执行一个自然的字母排序。
例如,下面的列表是自然排序(我想要的):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
下面是上面列表的“排序”版本(我使用sorted()得到的):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
我在寻找一个排序函数它的行为和第一个一样。
当前回答
现在是一些更优雅的东西(pythonic)
-只要碰一下
有很多实现,虽然有些已经接近,但没有一个能完全捕获现代python所提供的优雅。
使用python测试(3.5.1) 包含了一个额外的列表,以演示当 数字在字符串中间 没有测试,但是,我假设如果您的列表是相当大的,那么事先编译正则表达式会更有效 如果这是一个错误的假设,我相信有人会纠正我
罢工
from re import compile, split
dre = compile(r'(\d+)')
mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
完整代码
#!/usr/bin/python3
# coding=utf-8
"""
Natural-Sort Test
"""
from re import compile, split
dre = compile(r'(\d+)')
mylist = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13', 'elm']
mylist2 = ['e0lm', 'e1lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm', 'e01lm']
mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
mylist2.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
print(mylist)
# ['elm', 'elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
print(mylist2)
# ['e0lm', 'e1lm', 'e01lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm']
使用时注意
从操作系统。路径导入分割 您需要区分导入
灵感来自
Python文档-如何排序 人类排序:自然排序顺序 人的分类 这篇文章的贡献者/评论员和引用帖子
其他回答
本职位的价值
我的观点是提供一个可以普遍应用的非正则表达式解决方案。 我将创建三个函数:
find_first_digit,这是我从@AnuragUniyal借来的。它将查找字符串中第一个数字或非数字的位置。 Split_digits是一个生成器,它将字符串分成数字块和非数字块。当它是数字时,它也会产生整数。 Natural_key只是将split_digits包装成一个元组。这是我们用来排序,最大,最小的键。
功能
def find_first_digit(s, non=False):
for i, x in enumerate(s):
if x.isdigit() ^ non:
return i
return -1
def split_digits(s, case=False):
non = True
while s:
i = find_first_digit(s, non)
if i == 0:
non = not non
elif i == -1:
yield int(s) if s.isdigit() else s if case else s.lower()
s = ''
else:
x, s = s[:i], s[i:]
yield int(x) if x.isdigit() else x if case else x.lower()
def natural_key(s, *args, **kwargs):
return tuple(split_digits(s, *args, **kwargs))
我们可以看到它是一般的,因为我们可以有多个数字块:
# Note that the key has lower case letters
natural_key('asl;dkfDFKJ:sdlkfjdf809lkasdjfa_543_hh')
('asl;dkfdfkj:sdlkfjdf', 809, 'lkasdjfa_', 543, '_hh')
或保留大小写敏感:
natural_key('asl;dkfDFKJ:sdlkfjdf809lkasdjfa_543_hh', True)
('asl;dkfDFKJ:sdlkfjdf', 809, 'lkasdjfa_', 543, '_hh')
我们可以看到它以适当的顺序对OP的列表进行排序
sorted(
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13'],
key=natural_key
)
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
但它也可以处理更复杂的列表:
sorted(
['f_1', 'e_1', 'a_2', 'g_0', 'd_0_12:2', 'd_0_1_:2'],
key=natural_key
)
['a_2', 'd_0_1_:2', 'd_0_12:2', 'e_1', 'f_1', 'g_0']
我的正则表达式等价于
def int_maybe(x):
return int(x) if str(x).isdigit() else x
def split_digits_re(s, case=False):
parts = re.findall('\d+|\D+', s)
if not case:
return map(int_maybe, (x.lower() for x in parts))
else:
return map(int_maybe, parts)
def natural_key_re(s, *args, **kwargs):
return tuple(split_digits_re(s, *args, **kwargs))
下面是马克·拜尔斯的另一个版本的回答。这个版本演示了如何传入一个属性名,该属性名将用于计算列表中的对象。
def natural_sort(l, attrib):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key.__dict__[attrib])]
return sorted(l, key=alphanum_key)
results = natural_sort(albums, 'albumid')
其中albums是一个Album实例列表,albumid是一个字符串属性,名义上包含数字。
一个紧凑的解决方案,基于将字符串转换为List[Tuple(str, int)]。
Code
def string_to_pairs(s, pairs=re.compile(r"(\D*)(\d*)").findall):
return [(text.lower(), int(digits or 0)) for (text, digits) in pairs(s)[:-1]]
示范
sorted(['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9'], key=string_to_pairs)
输出:
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
测试
转换
assert string_to_pairs("") == []
assert string_to_pairs("123") == [("", 123)]
assert string_to_pairs("abc") == [("abc", 0)]
assert string_to_pairs("123abc") == [("", 123), ("abc", 0)]
assert string_to_pairs("abc123") == [("abc", 123)]
assert string_to_pairs("123abc456") == [("", 123), ("abc", 456)]
assert string_to_pairs("abc123efg") == [("abc", 123), ("efg", 0)]
排序
# Some extracts from the test suite of the natsort library. Permalink:
# https://github.com/SethMMorton/natsort/blob/e3c32f5638bf3a0e9a23633495269bea0e75d379/tests/test_natsorted.py
sort_data = [
( # same as test_natsorted_can_sort_as_unsigned_ints_which_is_default()
["a50", "a51.", "a50.31", "a-50", "a50.4", "a5.034e1", "a50.300"],
["a5.034e1", "a50", "a50.4", "a50.31", "a50.300", "a51.", "a-50"],
),
( # same as test_natsorted_numbers_in_ascending_order()
["a2", "a5", "a9", "a1", "a4", "a10", "a6"],
["a1", "a2", "a4", "a5", "a6", "a9", "a10"],
),
( # same as test_natsorted_can_sort_as_version_numbers()
["1.9.9a", "1.11", "1.9.9b", "1.11.4", "1.10.1"],
["1.9.9a", "1.9.9b", "1.10.1", "1.11", "1.11.4"],
),
( # different from test_natsorted_handles_filesystem_paths()
[
"/p/Folder (10)/file.tar.gz",
"/p/Folder (1)/file (1).tar.gz",
"/p/Folder/file.x1.9.tar.gz",
"/p/Folder (1)/file.tar.gz",
"/p/Folder/file.x1.10.tar.gz",
],
[
"/p/Folder (1)/file (1).tar.gz",
"/p/Folder (1)/file.tar.gz",
"/p/Folder (10)/file.tar.gz",
"/p/Folder/file.x1.9.tar.gz",
"/p/Folder/file.x1.10.tar.gz",
],
),
( # same as test_natsorted_path_extensions_heuristic()
[
"Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
"Try.Me.Bug - 07 - One.Two.5.[text].mkv",
"Try.Me.Bug - 08 - One.Two.Three[text].mkv",
],
[
"Try.Me.Bug - 07 - One.Two.5.[text].mkv",
"Try.Me.Bug - 08 - One.Two.Three[text].mkv",
"Try.Me.Bug - 09 - One.Two.Three.[text].mkv",
],
),
( # same as ns.IGNORECASE for test_natsorted_supports_case_handling()
["Apple", "corn", "Corn", "Banana", "apple", "banana"],
["Apple", "apple", "Banana", "banana", "corn", "Corn"],
),
]
for (given, expected) in sort_data:
assert sorted(given, key=string_to_pairs) == expected
奖金
如果字符串混合了非ascii文本和数字,您可能会对将string_to_pairs()与我在其他地方给出的函数remove_diacritics()组合感兴趣。
一种选择是将字符串转换为元组,并使用展开形式http://wiki.answers.com/Q/What_does_expanded_form_mean替换数字
这样a90就会变成("a",90,0)而a1就会变成("a",1)
下面是一些示例代码(这不是很有效,因为它从数字中删除前导0的方式)
alist=["something1",
"something12",
"something17",
"something2",
"something25and_then_33",
"something25and_then_34",
"something29",
"beta1.1",
"beta2.3.0",
"beta2.33.1",
"a001",
"a2",
"z002",
"z1"]
def key(k):
nums=set(list("0123456789"))
chars=set(list(k))
chars=chars-nums
for i in range(len(k)):
for c in chars:
k=k.replace(c+"0",c)
l=list(k)
base=10
j=0
for i in range(len(l)-1,-1,-1):
try:
l[i]=int(l[i])*base**j
j+=1
except:
j=0
l=tuple(l)
print l
return l
print sorted(alist,key=key)
输出:
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 1)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 7)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 3)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 4)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 9)
('b', 'e', 't', 'a', 1, '.', 1)
('b', 'e', 't', 'a', 2, '.', 3, '.')
('b', 'e', 't', 'a', 2, '.', 30, 3, '.', 1)
('a', 1)
('a', 2)
('z', 2)
('z', 1)
['a001', 'a2', 'beta1.1', 'beta2.3.0', 'beta2.33.1', 'something1', 'something2', 'something12', 'something17', 'something25and_then_33', 'something25and_then_34', 'something29', 'z1', 'z002']
上面的答案对于上面给出的具体例子是有用的,但对于更普遍的自然排序问题,却遗漏了几个有用的例子。我刚刚被其中一个案例咬了一口,所以想出了一个更彻底的解决方案:
def natural_sort_key(string_or_number):
"""
by Scott S. Lawton <scott@ProductArchitect.com> 2014-12-11; public domain and/or CC0 license
handles cases where simple 'int' approach fails, e.g.
['0.501', '0.55'] floating point with different number of significant digits
[0.01, 0.1, 1] already numeric so regex and other string functions won't work (and aren't required)
['elm1', 'Elm2'] ASCII vs. letters (not case sensitive)
"""
def try_float(astring):
try:
return float(astring)
except:
return astring
if isinstance(string_or_number, basestring):
string_or_number = string_or_number.lower()
if len(re.findall('[.]\d', string_or_number)) <= 1:
# assume a floating point value, e.g. to correctly sort ['0.501', '0.55']
# '.' for decimal is locale-specific, e.g. correct for the Anglosphere and Asia but not continental Europe
return [try_float(s) for s in re.split(r'([\d.]+)', string_or_number)]
else:
# assume distinct fields, e.g. IP address, phone number with '.', etc.
# caveat: might want to first split by whitespace
# TBD: for unicode, replace isdigit with isdecimal
return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_or_number)]
else:
# consider: add code to recurse for lists/tuples and perhaps other iterables
return string_or_number
测试代码和几个链接(在StackOverflow上和关闭)在这里: http://productarchitect.com/code/better-natural-sort.py
欢迎您的反馈。这并不是一个明确的解决方案;只是向前迈出了一步。