如何使用 linux shell 脚本删除文件中的 ^ [和所有转义序列

小开

Are you looking for ansifilter?

Two things you can do: enter the literal escape (in bash:)

Using keyboard entry:

sed 's/Ctrl-vEsc//g'

alternatively

sed 's/Ctrl-vCtrl-[//g'

Or you can use character escapes:

sed 's/\x1b//g'

or for all control characters:

sed 's/[\x01-\x1F\x7F]//g' # NOTE: zaps TAB character too!

小开

I've stumbled upon this post when looking for a way to strip extra formatting from man pages. ansifilter did it, but it was far from desired result (for example all previously-bold characters were duplicated, like SSYYNNOOPPSSIISS).

For that task the correct command would be col -bx, for example:

groff -man -Tascii fopen.3 | col -bx > fopen.3.txt

(source)

Why this works: (in response to a comment by @AttRigh)

groff produces bold characters like you would on a typewriter: print a letter, move one character back with backspace (you can't erase text on a typewriter), print the same letter again to make the character more pronounced. So simply omitting backspaces produces "SSYYNNOOPPSSIISS". col -b fixes this by interpreting backspaces correctly, quote from the manual:

-b Do not output any backspaces, printing only the last character written to each column position.

小开

I managed with the following for my purposes, but this doesn't include all possible ANSI escapes:

sed -r s/\x1b\[[0-9;]*m?//g

This removes m commands, but for all escapes (as commented by @lethalman) use:

sed -r s/\x1b\[[^@-~]*[@-~]//g

Also see "https://stackoverflow.com/questions/7857352/python-regex-to-match-vt100-escape-sequences".

There is also a table of common escape sequences.

小开

Just a note; let's say you have a file like this (such line endings are generated by git remote reports):

echo -e "remote: * 27625a8 (HEAD, master) 1st git commit\x1b[K
remote: \x1b[K
remote: \x1b[K
remote: \x1b[K
remote: \x1b[K
remote: \x1b[K
remote: Current branch master is up to date.\x1b[K" > chartest.txt

In binary, this looks like this:

$ cat chartest.txt | hexdump -C
00000000  72 65 6d 6f 74 65 3a 20  2a 20 32 37 36 32 35 61  |remote: * 27625a|
00000010  38 20 28 48 45 41 44 2c  20 6d 61 73 74 65 72 29  |8 (HEAD, master)|
00000020  20 31 73 74 20 67 69 74  20 63 6f 6d 6d 69 74 1b  | 1st git commit.|
00000030  5b 4b 0a 72 65 6d 6f 74  65 3a 20 1b 5b 4b 0a 72  |[K.remote: .[K.r|
00000040  65 6d 6f 74 65 3a 20 1b  5b 4b 0a 72 65 6d 6f 74  |emote: .[K.remot|
00000050  65 3a 20 1b 5b 4b 0a 72  65 6d 6f 74 65 3a 20 1b  |e: .[K.remote: .|
00000060  5b 4b 0a 72 65 6d 6f 74  65 3a 20 1b 5b 4b 0a 72  |[K.remote: .[K.r|
00000070  65 6d 6f 74 65 3a 20 43  75 72 72 65 6e 74 20 62  |emote: Current b|
00000080  72 61 6e 63 68 20 6d 61  73 74 65 72 20 69 73 20  |ranch master is |
00000090  75 70 20 74 6f 20 64 61  74 65 2e 1b 5b 4b 0a     |up to date..[K.|
0000009f

It is visible that git here adds the sequence 0x1b 0x5b 0x4b before the line ending (0x0a).

Note that - while you can match the 0x1b with a literal format \x1b in sed, you CANNOT do the same for 0x5b, which represents the left square bracket [:

$ cat chartest.txt | sed 's/\x1b\x5b//g' | hexdump -C
sed: -e expression #1, char 13: Invalid regular expression

You might think you can escape the representation with an extra backslash \ - which ends up as \\x5b; but while that "passes" - it doesn't match anything as intended:

$ cat chartest.txt | sed 's/\x1b\\x5b//g' | hexdump -C
00000000  72 65 6d 6f 74 65 3a 20  2a 20 32 37 36 32 35 61  |remote: * 27625a|
00000010  38 20 28 48 45 41 44 2c  20 6d 61 73 74 65 72 29  |8 (HEAD, master)|
00000020  20 31 73 74 20 67 69 74  20 63 6f 6d 6d 69 74 1b  | 1st git commit.|
00000030  5b 4b 0a 72 65 6d 6f 74  65 3a 20 1b 5b 4b 0a 72  |[K.remote: .[K.r|
00000040  65 6d 6f 74 65 3a 20 1b  5b 4b 0a 72 65 6d 6f 74  |emote: .[K.remot|
...

So if you want to match this character, apparently you must write it as escaped left square bracket, that is \[ - the rest of the values can than be entered with escaped \x notation:

$ cat chartest.txt | sed 's/\x1b\[\x4b//g' | hexdump -C
00000000  72 65 6d 6f 74 65 3a 20  2a 20 32 37 36 32 35 61  |remote: * 27625a|
00000010  38 20 28 48 45 41 44 2c  20 6d 61 73 74 65 72 29  |8 (HEAD, master)|
00000020  20 31 73 74 20 67 69 74  20 63 6f 6d 6d 69 74 0a  | 1st git commit.|
00000030  72 65 6d 6f 74 65 3a 20  0a 72 65 6d 6f 74 65 3a  |remote: .remote:|
00000040  20 0a 72 65 6d 6f 74 65  3a 20 0a 72 65 6d 6f 74  | .remote: .remot|
00000050  65 3a 20 0a 72 65 6d 6f  74 65 3a 20 0a 72 65 6d  |e: .remote: .rem|
00000060  6f 74 65 3a 20 43 75 72  72 65 6e 74 20 62 72 61  |ote: Current bra|
00000070  6e 63 68 20 6d 61 73 74  65 72 20 69 73 20 75 70  |nch master is up|
00000080  20 74 6f 20 64 61 74 65  2e 0a                    | to date..|
0000008a

小开

ansi2txt command (part of kbtin package) seems to be doing the job perfectly on Ubuntu.

小开

commandlinefu gives the correct answer which strips ANSI colours as well as movement commands:

sed "s,\x1B\[[0-9;]*[a-zA-Z],,g"

小开

I built vtclean for this. It strips escape sequences using these regular expressions in order (explained in regex.txt):

// handles long-form RGB codes
^\033](\d+);([^\033]+)\033\\


// excludes non-movement/color codes
^\033(\[[^a-zA-Z0-9@\?]+|[\(\)]).


// parses movement and color codes
^\033([\[\]]([\d\?]+)?(;[\d\?]+)*)?(.)`)

It additionally does basic line-edit emulation, so backspace and other movement characters (like left arrow key) are parsed.

小开

I don't have enough reputation to add a comment to the answer given by Luke H, but I did want to share the regular expression that I've been using to eliminate all of the ASCII Escape Sequences.

sed -r 's~\x01?(\x1B\(B)?\x1B\[([0-9;]*)?[JKmsu]\x02?~~g'

小开

You can remove all non printable characters with this:

sed 's/[^[:print:]]//g'

小开

Tom Hale's answer left unwanted codes, but was a good base to work from. Adding additional filtering cleared out leftover, unwanted codes:

sed -e "s,^[[[(][0-9;?]*[a-zA-Z],,g" \
-e "s/^[[[][0-9][0-9]*[@]//" \
-e "s/^[[=0-9]<[^>]*>//" \
-e "s/^[[)][0-9]//" \
-e "s/.^H//g" \
-e "s/^M//g" \
-e "s/^^H//" \
file.dirty > file.clean

As this was done on a non-GNU version of sed, where you see ^[, ^H, and ^M, I used Ctrl-V <Esc>, Ctrl-V Ctrl-H, and Ctrl-V Ctrl-M respectively. The ^> is literally a carat (^) and greater-than character, not Ctrl-<.

TERM=xterm was in use at the time.

小开

A bash snippet I've been using for stripping out (at least some) ANSI colors:

shopt -s extglob
while IFS='' read -r line; do
echo "${line//$'\x1b'\[*([0-9;])[Km]/}"
done

小开

sed based approach without extended regular expressions enabled by -r

sed 's/\x1B\[[0-9;]*[JKmsu]//g'

小开

My answer to

What are these weird ha:// URLs jenkins fills our logs with?

removes all ANSI escape sequences from Jenkins console log files effectively (it also deals with Jenkins-specific URLs which wouldn't be relevant here).

I acknowledge and appreciate the contributions of Marius Gedminas and pyjama from this thread in formulating the ultimate solution.

小开

This simple awk solution worked for me, try this:

str="happy $(tput setaf 1)new$(tput sgr0) year!" #colored text
echo $str | awk '{gsub("(.\\[[0-9]+m|.\\(..\\[m)","",$0)}1' #remove ansi colors