Python截断一个长字符串

在Python中如何将字符串截断为75个字符?

在JavaScript中是这样做的:

var data="saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
var info = (data.length > 75) ? data.substring[0,75] + '..' : data;
399741 次浏览
info = (data[:75] + '..') if len(data) > 75 else data

更简明的说:

data = data[:75]

如果小于75个字符,则不会有任何更改。

更简短的是:

info = data[:75] + (data[75:] and '..')
       >>> info = lambda data: len(data)>10 and data[:10]+'...' or data
>>> info('sdfsdfsdfsdfsdfsdfsdfsdfsdfsdfsdf')
'sdfsdfsdfs...'
>>> info('sdfsdf')
'sdfsdf'
>>>

正则表达式:

re.sub(r'^(.{75}).*$', '\g<1>...', data)

长字符串被截断:

>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'111111111122222222223333333333444444444455555555556666666666777777777788888...'

较短的字符串永远不会被截断:

>>> data="11111111112222222222333333"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'11111111112222222222333333'

通过这种方式,你还可以“切割”字符串的中间部分,这在某些情况下会更好:

re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)


>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)
'11111...88888'
你不能像动态分配C字符串那样“截断”Python字符串。Python中的字符串是不可变的。您可以像其他答案中描述的那样对字符串进行切片,生成一个只包含由切片偏移量和步长定义的字符的新字符串。 在某些(不实际的)情况下,这可能有点烦人,比如当你选择Python作为面试语言时,面试官要求你从一个字符串中删除重复的字符。哎。< / p >

对于Django解决方案(问题中没有提到):

from django.utils.text import Truncator
value = Truncator(value).chars(75)
看看Truncator的源代码来理解这个问题: https://github.com/django/django/blob/master/django/utils/text.py#L66 < / p > 关于Django的截断: Django HTML截断 < / p >

这是另一种解决方案。使用TrueFalse,你会在最后得到一些关于测试的反馈。

data = {True: data[:75] + '..', False: data}[len(data) > 75]

不需要正则表达式,但您确实希望在接受的答案中使用字符串格式而不是字符串连接。

这可能是将字符串data截断为75个字符的最规范的python方式。

>>> data = "saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
>>> info = "{}..".format(data[:75]) if len(data) > 75 else data
>>> info
'111111111122222222223333333333444444444455555555556666666666777777777788888...'

这个方法不使用任何if:

< p > <代码> Data [:75] + bool(Data [75:]) * '..' < /代码> < / p >

如果你使用的是Python 3.4+,你可以使用标准库中的textwrap.shorten:

折叠并截断给定的文本以适应给定的宽度。

首先,文本中的空白被折叠(所有空白被替换) 空格)。如果结果符合宽度,则返回。 否则,从末尾删除足够多的单词,以使剩余的 单词加上占位符适合width:

>>> textwrap.shorten("Hello  world!", width=12)
'Hello world!'
>>> textwrap.shorten("Hello  world!", width=11)
'Hello [...]'
>>> textwrap.shorten("Hello world", width=10, placeholder="...")
'Hello...'

刚刚收到的消息:

n = 8
s = '123'
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '12345678'
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '123456789'
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '123456789012345'
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]


123
12345678
12345...
12345...

这是一个函数,我把它作为一个新的String类的一部分…它允许添加后缀(如果字符串是修剪后的大小,并且添加它足够长-尽管你不需要强制绝对大小)

我在改变一些东西的过程中,所以有一些无用的逻辑成本(如果_truncate…例如),不再需要它,并且在顶部有一个return…

但是,它仍然是一个截断数据的好函数……

##
## Truncate characters of a string after _len'nth char, if necessary... If _len is less than 0, don't truncate anything... Note: If you attach a suffix, and you enable absolute max length then the suffix length is subtracted from max length... Note: If the suffix length is longer than the output then no suffix is used...
##
## Usage: Where _text = 'Testing', _width = 4
##      _data = String.Truncate( _text, _width )                        == Test
##      _data = String.Truncate( _text, _width, '..', True )            == Te..
##
## Equivalent Alternates: Where _text = 'Testing', _width = 4
##      _data = String.SubStr( _text, 0, _width )                       == Test
##      _data = _text[  : _width ]                                      == Test
##      _data = ( _text )[  : _width ]                                  == Test
##
def Truncate( _text, _max_len = -1, _suffix = False, _absolute_max_len = True ):
## Length of the string we are considering for truncation
_len            = len( _text )


## Whether or not we have to truncate
_truncate       = ( False, True )[ _len > _max_len ]


## Note: If we don't need to truncate, there's no point in proceeding...
if ( not _truncate ):
return _text


## The suffix in string form
_suffix_str     = ( '',  str( _suffix ) )[ _truncate and _suffix != False ]


## The suffix length
_len_suffix     = len( _suffix_str )


## Whether or not we add the suffix
_add_suffix     = ( False, True )[ _truncate and _suffix != False and _max_len > _len_suffix ]


## Suffix Offset
_suffix_offset = _max_len - _len_suffix
_suffix_offset  = ( _max_len, _suffix_offset )[ _add_suffix and _absolute_max_len != False and _suffix_offset > 0 ]


## The truncate point.... If not necessary, then length of string.. If necessary then the max length with or without subtracting the suffix length... Note: It may be easier ( less logic cost ) to simply add the suffix to the calculated point, then truncate - if point is negative then the suffix will be destroyed anyway.
## If we don't need to truncate, then the length is the length of the string.. If we do need to truncate, then the length depends on whether we add the suffix and offset the length of the suffix or not...
_len_truncate   = ( _len, _max_len )[ _truncate ]
_len_truncate   = ( _len_truncate, _max_len )[ _len_truncate <= _max_len ]


## If we add the suffix, add it... Suffix won't be added if the suffix is the same length as the text being output...
if ( _add_suffix ):
_text = _text[ 0 : _suffix_offset ] + _suffix_str + _text[ _suffix_offset: ]


## Return the text after truncating...
return _text[ : _len_truncate ]
limit = 75
info = data[:limit] + '..' * (len(data) > limit)
info = data[:75] + ('..' if len(data) > 75 else '')
info = data[:min(len(data), 75)

简单而简短的helper函数:

def truncate_string(value, max_length=255, suffix='...'):
string_value = str(value)
string_truncated = string_value[:min(len(string_value), (max_length - len(suffix)))]
suffix = (suffix if len(string_value) > max_length else '')
return string_truncated+suffix

使用例子:

# Example 1 (default):


long_string = ""
for number in range(1, 1000):
long_string += str(number) + ','


result = truncate_string(long_string)
print(result)




# Example 2 (custom length):


short_string = 'Hello world'
result = truncate_string(short_string, 8)
print(result) # > Hello...




# Example 3 (not truncated):


short_string = 'Hello world'
result = truncate_string(short_string)
print(result) # > Hello world


来的很晚,我想把我的解决方案添加到在字符级别上修剪文本,也能正确处理空白

def trim_string(s: str, limit: int, ellipsis='…') -> str:
s = s.strip()
if len(s) > limit:
return s[:limit-1].strip() + ellipsis
return s

简单,但它将确保你的hello worldlimit=6不会导致一个丑陋的hello …,而是hello…

它还删除开头和结尾的空格,但不删除里面的空格。如果你还想删除里面的空格,签出这篇stackoverflow文章

这里我使用textwrap.shorten来处理更多的边缘情况。也包括最后一个单词的一部分,以防这个单词超过最大宽度的50%。

import textwrap




def shorten(text: str, width=30, placeholder="..."):
"""Collapse and truncate the given text to fit in the given width.


The text first has its whitespace collapsed. If it then fits in the *width*, it is returned as is.
Otherwise, as many words as possible are joined and then the placeholder is appended.
"""
if not text or not isinstance(text, str):
return str(text)
t = text.strip()
if len(t) <= width:
return t


# textwrap.shorten also throws ValueError if placeholder too large for max width
shorten_words = textwrap.shorten(t, width=width, placeholder=placeholder)


# textwrap.shorten doesn't split words, so if the text contains a long word without spaces, the result may be too short without this word.
# Here we use a different way to include the start of this word in case shorten_words is less than 50% of `width`
if len(shorten_words) - len(placeholder) < (width - len(placeholder)) * 0.5:
return t[:width - len(placeholder)].strip() + placeholder
return shorten_words

测试:

>>> shorten("123 456", width=7, placeholder="...")
'123 456'
>>> shorten("1 23 45 678 9", width=12, placeholder="...")
'1 23 45...'
>>> shorten("1 23 45 678 9", width=10, placeholder="...")
'1 23 45...'
>>> shorten("01 23456789", width=10, placeholder="...")
'01 2345...'
>>> shorten("012 3 45678901234567", width=17, placeholder="...")
'012 3 45678901...'
>>> shorten("1 23 45 678 9", width=9, placeholder="...")
'1 23...'
>>> shorten("1 23456", width=5, placeholder="...")
'1...'
>>> shorten("123 456", width=5, placeholder="...")
'12...'
>>> shorten("123 456", width=6, placeholder="...")
'123...'
>>> shorten("12 3456789", width=9, placeholder="...")
'12 345...'
>>> shorten("   12 3456789    ", width=9, placeholder="...")
'12 345...'
>>> shorten('123 45', width=4, placeholder="...")
'1...'
>>> shorten('123 45', width=3, placeholder="...")
'...'
>>> shorten("123456", width=3, placeholder="...")
'...'
>>> shorten([1], width=9, placeholder="...")
'[1]'
>>> shorten(None, width=5, placeholder="...")
'None'
>>> shorten("", width=9, placeholder="...")
''

假设stryng是我们希望截断的字符串,而nchars是输出字符串中所需的字符数。

stryng = "sadddddddddddddddddddddddddddddddddddddddddddddddddd"
nchars = 10

我们可以像下面这样截断字符串:

def truncate(stryng:str, nchars:int):
return (stryng[:nchars - 6] + " [...]")[:min(len(stryng), nchars)]

某些测试用例的结果如下所示:

s = "sadddddddddddddddddddddddddddddd!"
s = "sa" + 30*"d" + "!"


truncate(s, 2)                ==  sa
truncate(s, 4)                ==  sadd
truncate(s, 10)               ==  sadd [...]
truncate(s, len(s)//2)        ==  sadddddddd [...]

我的解决方案为上面的测试用例产生了合理的结果。

但一些病理病例如下:

一些病理病例!

truncate(s, len(s) - 3)()       ==  sadddddddddddddddddddddd [...]
truncate(s, len(s) - 2)()       ==  saddddddddddddddddddddddd [...]
truncate(s, len(s) - 1)()       ==  sadddddddddddddddddddddddd [...]
truncate(s, len(s) + 0)()       ==  saddddddddddddddddddddddddd [...]
truncate(s, len(s) + 1)()       ==  sadddddddddddddddddddddddddd [...
truncate(s, len(s) + 2)()       ==  saddddddddddddddddddddddddddd [..
truncate(s, len(s) + 3)()       ==  sadddddddddddddddddddddddddddd [.
truncate(s, len(s) + 4)()       ==  saddddddddddddddddddddddddddddd [
truncate(s, len(s) + 5)()       ==  sadddddddddddddddddddddddddddddd
truncate(s, len(s) + 6)()       ==  sadddddddddddddddddddddddddddddd!
truncate(s, len(s) + 7)()       ==  sadddddddddddddddddddddddddddddd!
truncate(s, 9999)()             ==  sadddddddddddddddddddddddddddddd!

值得注意的是,

  • 当字符串包含换行字符(\n)时,可能会出现问题。
  • nchars > len(s)时,我们应该打印字符串s,而不是试图打印“__abc2”;

下面是更多的代码:

import io


class truncate:
"""
Example of Code Which Uses truncate:
```
s = "\r<class\n 'builtin_function_or_method'>"
s = truncate(s, 10)()
print(s)
```
Examples of Inputs and Outputs:
truncate(s, 2)()   ==  \r
truncate(s, 4)()   ==  \r<c
truncate(s, 10)()  ==  \r<c [...]
truncate(s, 20)()  ==  \r<class\n 'bu [...]
truncate(s, 999)() ==  \r<class\n 'builtin_function_or_method'>
```
Other Notes:
Returns a modified copy of string input
Does not modify the original string
"""
def __init__(self, x_stryng: str, x_nchars: int) -> str:
"""
This initializer mostly exists to sanitize function inputs
"""
try:
stryng = repr("".join(str(ch) for ch in x_stryng))[1:-1]
nchars = int(str(x_nchars))
except BaseException as exc:
invalid_stryng =  str(x_stryng)
invalid_stryng_truncated = repr(type(self)(invalid_stryng, 20)())


invalid_x_nchars = str(x_nchars)
invalid_x_nchars_truncated = repr(type(self)(invalid_x_nchars, 20)())


strm = io.StringIO()
print("Invalid Function Inputs", file=strm)
print(type(self).__name__, "(",
invalid_stryng_truncated,
", ",
invalid_x_nchars_truncated, ")", sep="", file=strm)
msg = strm.getvalue()


raise ValueError(msg) from None


self._stryng = stryng
self._nchars = nchars


def __call__(self) -> str:
stryng = self._stryng
nchars = self._nchars
return (stryng[:nchars - 6] + " [...]")[:min(len(stryng), nchars)]

下面是一个简单的函数,它将从任意一侧截断给定字符串:

def truncate(string, length=75, beginning=True, insert='..'):
'''Shorten the given string to the given length.
An ellipsis will be added to the section trimmed.


:Parameters:
length (int) = The maximum allowed length before trunicating.
beginning (bool) = Trim starting chars, else; ending.
insert (str) = Chars to add at the trimmed area. (default: ellipsis)


:Return:
(str)


ex. call: truncate('12345678', 4)
returns: '..5678'
'''
if len(string)>length:
if beginning: #trim starting chars.
string = insert+string[-length:]
else: #trim ending chars.
string = string[:length]+insert
return string

如果你想做一些更复杂的字符串截断,你可以采用sklearn方法作为实现:

< p > sklearn.base.BaseEstimator.__repr__ (参见原始完整代码:https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/base.py#L262)

它增加了一些好处,比如避免在单词中间截断。

def truncate_string(data, N_CHAR_MAX=70):
# N_CHAR_MAX is the (approximate) maximum number of non-blank
# characters to render. We pass it as an optional parameter to ease
# the tests.


lim = N_CHAR_MAX // 2  # apprx number of chars to keep on both ends
regex = r"^(\s*\S){%d}" % lim
# The regex '^(\s*\S){%d}' % n
# matches from the start of the string until the nth non-blank
# character:
# - ^ matches the start of string
# - (pattern){n} matches n repetitions of pattern
# - \s*\S matches a non-blank char following zero or more blanks
left_lim = re.match(regex, data).end()
right_lim = re.match(regex, data[::-1]).end()
if "\n" in data[left_lim:-right_lim]:
# The left side and right side aren't on the same line.
# To avoid weird cuts, e.g.:
# categoric...ore',
# we need to start the right side with an appropriate newline
# character so that it renders properly as:
# categoric...
# handle_unknown='ignore',
# so we add [^\n]*\n which matches until the next \n
regex += r"[^\n]*\n"
right_lim = re.match(regex, data[::-1]).end()
ellipsis = "..."
if left_lim + len(ellipsis) < len(data) - right_lim:
# Only add ellipsis if it results in a shorter repr
data = data[:left_lim] + "..." + data[-right_lim:]
return data