如何在 shell 中解码 URL 编码的字符串?

我有一个文件的用户代理列表,这是编码。 例如:

Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

我想要一个 shell 脚本,可以读取这个文件,并写入一个新的文件与解码字符串。

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

我一直试图用这个例子来启动它,但到目前为止还没有奏效。

$ echo -e "$(echo "%31+%32%0A%33+%34" | sed 'y/+/ /; s/%/\\x/g')"

我的剧本是这样的:

#!/bin/bash
for f in *.log; do
echo -e "$(cat $f | sed 'y/+/ /; s/%/\x/g')" > y.log
done
104629 次浏览

As @barti_ddu said in the comments, \x "should be [double-]escaped".

% echo -e "$(echo "Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en" | sed 'y/+/ /; s/%/\\x/g')"
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

Rather than mixing up Bash and sed, I would do this all in Python. Here's a rough cut of how:

#!/usr/bin/env python


import glob
import os
import urllib


for logfile in glob.glob(os.path.join('.', '*.log')):
with open(logfile) as current:
new_log_filename = logfile + '.new'
with open(new_log_filename, 'w') as new_log_file:
for url in current:
unquoted = urllib.unquote(url.strip())
new_log_file.write(unquoted + '\n')

This is what seems to be working for me.

#!/bin/bash
urldecode(){
echo -e "$(sed 's/+/ /g;s/%\(..\)/\\x\1/g;')"
}


for f in /opt/logs/*.log; do
name=${f##/*/}
cat $f | urldecode > /opt/logs/processed/$HOSTNAME.$name
done

Replacing '+'s with spaces, and % signs with '\x' escapes, and letting echo interpret the \x escapes using the '-e' option was not working. For some reason, the cat command was printing the % sign as its own encoded form %25. So sed was simply replacing %25 with \x25. When the -e option was used, it was simply evaluating \x25 as % and the output was same as the original.

Trace:

Original: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

sed: Mozilla\x252F5.0\x2520\x2528Macintosh\x253B\x2520U\x253B\x2520Intel\x2520Mac\x2520OS\x2520X\x252010.6\x253B\x2520en

echo -e: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

Fix: Basically ignore the 2 characters after the % in sed.

sed: Mozilla\x2F5.0\x20\x28Macintosh\x3B\x20U\x3B\x20Intel\x20Mac\x20OS\x20X\x2010.6\x3B\x20en

echo -e: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

Not sure what complications this would result in, after extensive testing, but works for now.

Bash script for doing it in native Bash (original source):

LANG=C


urlencode() {
local l=${#1}
for (( i = 0 ; i < l ; i++ )); do
local c=${1:i:1}
case "$c" in
[a-zA-Z0-9.~_-]) printf "$c" ;;
' ') printf + ;;
*) printf '%%%.2X' "'$c"
esac
done
}


urldecode() {
local data=${1//+/ }
printf '%b' "${data//%/\x}"
}

If you want to urldecode file content, just put the file content as an argument.

Here's a test that will run halt if the decoded encoded file content differs (if it runs for a few seconds, the script probably works correctly):

while true
do cat /dev/urandom | tr -d '\0' | head -c1000 > /tmp/tmp;
A="$(cat /tmp/tmp; printf x)"
A=${A%x}
A=$(urlencode "$A")
urldecode "$A" > /tmp/tmp2
cmp /tmp/tmp /tmp/tmp2
if [ $? != 0 ]
then break
fi
done

If you have php installed on your server, you can "cat" or even "tail" any file, with url encoded strings very easily.

tail -f nginx.access.log | php -R 'echo urldecode($argn)."\n";'

If you are a python developer, this maybe preferable:

For Python 3.x (default):

echo -n "%21%20" | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"

For Python 2.x (deprecated):

echo -n "%21%20" | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());"

urllib is really good at handling URL parsing

Here is a solution that is done in pure bash where input and output are bash variables. It will decode '+' as a space and handle the '%20' space, as well as other %-encoded characters.

#!/bin/bash
#here is text that contains both '+' for spaces and a %20
text="hello+space+1%202"
decoded=$(echo -e `echo $text | sed 's/+/ /g;s/%/\\\\x/g;'`)
echo decoded=$decoded
perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/pack H2,$1/gie' ./*.log

With -i updates the files in-place (some sed implementations have borrowed that from perl) with .back as the backup extension.

s/x/y/e substitutes x with the evaluation of the y perl code.

The perl code in this case uses pack to pack the hex number captured in $1 (first parentheses pair in the regexp) as the corresponding character.

An alternative to pack is to use chr(hex($1)):

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/chr hex $1/gie' ./*.log

If available, you could also use uri_unescape() from URI::Escape:

perl -pi.back -MURI::Escape -e 'y/+/ /;$_=uri_unescape$_' ./*.log

With GNU awk:

LC_ALL=C gawk -vRS='%[[:xdigit:]]{2}' '
RT {RT = sprintf("%c",strtonum("0x" substr(RT, 2)))}
{gsub(/\+/," ");printf "%s", $0 RT}'

Would take URI-encoded on stdin and print the decoded output on stdout.

We set the record separator as a regexp that matches a %XX sequence. In GNU awk, the input that matched it is stored in the RT special variable. We extract the hex digits from there, append to "0x" for strnum() to turn into a number, passed in turn to sprintf("%c") which in the C locale would convert to the corresponding byte value.

Just wanted to share this other solution, pure bash:

encoded_string="Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en"
printf -v decoded_string "%b" "${encoded_string//\%/\\x}"
echo $decoded_string
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

A slightly modified version of the Python answer that accepts an input and output file in a one liner.

cat inputfile.txt | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());" > ouputfile.txt
$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(printf "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$
$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(echo -e "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$

Here is a simple one-line solution.

$ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }

It may look like perl :) but it is just pure bash. No awks, no seds ... no overheads. Using the : builtin, special parameters, pattern substitution and the echo builtin's -e option to translate hex codes into characters. See bash's manpage for further details. You can use this function as separate command

$ urldecode https%3A%2F%2Fgoogle.com%2Fsearch%3Fq%3Durldecode%2Bbash
https://google.com/search?q=urldecode+bash

or in variable assignments, like so:

$ x="http%3A%2F%2Fstackoverflow.com%2Fsearch%3Fq%3Durldecode%2Bbash"
$ y=$(urldecode "$x")
$ echo "$y"
http://stackoverflow.com/search?q=urldecode+bash

With BASH, to read the per cent encoded URL from standard in and decode:

while read; do echo -e ${REPLY//%/\\x}; done

Press CTRL-D to signal the end of file(EOF) and quit gracefully.

You can decode the contents of a file by setting the file to be standard in:

while read; do echo -e ${REPLY//%/\\x}; done < file

You can decode input from a pipe either, for example:

echo 'a%21b' | while read; do echo -e ${REPLY//%/\\x}; done
  • The read built in command reads standard in until it sees a Line Feed character. It sets a variable called REPLY equal to the line of text it just read.
  • ${REPLY//%/\\x} replaces all instances of '%' with '\x'.
  • echo -e interprets \xNN as the ASCII character with hexadecimal value of NN.
  • while repeats this loop until the read command fails, eg. EOF has been reached.

The above does not change '+' to ' '. To change '+' to ' ' also, like guest's answer:

while read; do : "${REPLY//%/\\x}"; echo -e ${_//+/ }; done
  • : is a BASH builtin command. Here it just takes in a single argument and does nothing with it.
  • The double quotes make everything inside one single parameter.
  • _ is a special parameter that is equal to the last argument of the previous command, after argument expansion. This is the value of REPLY with all instances of '%' replaced with '\x'.
  • ${_//+/ } replaces all instances of '+' with ' '.

This uses only BASH and doesn't start any other process, similar to guest's answer.

Expanding to https://stackoverflow.com/a/37840948/8142470
to work with HTML entities

$ htmldecode() { : "${*//+/ }"; echo -e "${_//&#x/\x}" | tr -d ';'; }
$ htmldecode "http&#x3A;&#x2F;&#x2F;google.com&#x2F;search&&#x3F;q&#x3D;urldecode&#x2B;bash" http://google.com/search&?q=urldecode+bash

(argument must be quoted)

Updating Jay's answer for Python 3.5+:
echo "%31+%32%0A%33+%34" | python -c "import sys; from urllib.parse import unquote ; print(unquote(sys.stdin.read()))"

Still, brendan's bash solution with explanation seems more direct and elegant.

Building upon some of the other answers, but for the POSIX world, could use the following function:

url_decode() {
printf '%b\n' "$(sed -E -e 's/\+/ /g' -e 's/%([0-9a-fA-F]{2})/\\x\1/g')"
}

It uses printf '%b\n' because there is no echo -e and breaks the sed call to make it easier to read, forcing -E to be able to use references with \1. It also forces what follows % to look like some hex code.

With the zsh shell (instead of bash), the only shell whose variables can hold any byte value including NUL (encoded as %00):

set -o extendedglob +o multibyte
string='Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en'
decoded=${${string//+/ }//(#b)%([[:xdigit:]](#c2))/${(#):-0x$match[1]}}
  • ${var//pattern/replacement}: ksh-style parameter expansion operator to expand to the value of $var with every string matching pattern replaced with replacement.
  • (#b) activate back references so every part inside brackets in the pattern can be accessed as corresponding $match[n] in the replacement.
  • (#c2): equivalent of ERE {2}
  • ${(#)param-expansion}: parameter expansion where the # flag causes the result to be interpreted as an arithmetic expression and the corresponding byte value to be returned.
  • ${var:-value}: expands to value if $var is empty, here applied to no variable at all, so we can just specify an arbitrary string as the subject of a parameter expansion.

To make it a function that decodes the contents of a variable in-place:

uridecode_var() {
emulate -L zsh
set -o extendedglob +o multibyte
eval $1='${${'$1'//+/ }//(#b)%([[:xdigit:]](#c2))/${(#):-0x$match[1]}}'
}
$ string='Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en'
$ uridecode_var string
$ print -r -- $string
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

With sed:

#!/bin/bash
URL_DECODE="$(echo "$1" | sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'"
echo -e "$URL_DECODE"
  • s/%([0-9a-fA-F]{2})/\\x\1/g replaces % with \x to transform urlencoded to hexadecimal
  • s/\+/ /g replace + to space ' ', in case using + in query string

Just save it to decodeurl.sh and make it executable with chmod +x decodeurl.sh

If you need a way do encode too, this complete code will help:

#!/bin/bash
#
# Enconding e Decoding de URL com sed
#
# Por Daniel Cambría
# daniel.cambria@bureau-it.com
#
# jul/2021


function url_decode() {
echo "$@" \
| sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'
}


function url_encode() {
# Conforme RFC 3986
echo "$@" \
| sed \
-e 's/ /%20/g' \
-e 's/:/%3A/g' \
-e 's/,/%2C/g' \
-e 's/\?/%3F/g' \
-e 's/#/%23/g' \
-e 's/\[/%5B/g' \
-e 's/\]/%5D/g' \
-e 's/@/%40/g' \
-e 's/!/%41/g' \
-e 's/\$/%24/g' \
-e 's/&/%26/g' \
-e "s/'/%27/g" \
-e 's/(/%28/g' \
-e 's/)/%29/g' \
-e 's/\*/%2A/g' \
-e 's/\+/%2B/g' \
-e 's/,/%2C/g' \
-e 's/;/%3B/g' \
-e 's/=/%3D/g'
}


echo -e "URL decode: " $(url_decode "$1")
echo -e "URL encode: " $(url_encode "$1")

python, for zshrc

# Usage: decodeUrl %3A%2F%2F
function decodeUrl(){
echo "$1" | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"
}


# Usage: encodeUrl https://google.com/search?q=urldecode+bash
#          return: https://google.com/search\?q\=urldecode+bash
function encodeUrl(){
echo "$1" | python3 -c "import sys; from urllib.parse import quote; print(quote(sys.stdin.read()));"
}

bash idiom for url-decoding

Here is a bash idiom for url-decoding a string held in variabe x and assigning the result to variable y:

: "${x//+/ }"; printf -v y '%b' "${_//%/\\x}"

Unlike the accepted answer, it preserves trailing newlines during assignment. (Try assigning the result of url-decoding v%0A%0A%0A to a variable.)

It also is fast. It is 6700% faster at assigning the result of url-decoding to a variable than the accepted answer.

Caveat: It is not possible for a bash variable to contain a NUL. For example, any bash solution attempting to decode %00 and assign the result to a variable will not work.

Benchmark details

function.sh

#!/bin/bash
urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
x=%21%20
for (( i=0; i<5000; i++ )); do
y=$(urldecode "$x")
done

idiom.sh

#!/bin/bash
x=%21%20
for (( i=0; i<5000; i++ )); do
: "${x//+/ }"; printf -v y '%b' "${_//%/\\x}"
done
$ hyperfine --warmup 5 ./function.sh ./idiom.sh
Benchmark #1: ./function.sh
Time (mean ± σ):      2.844 s ±  0.036 s    [User: 1.728 s, System: 1.494 s]
Range (min … max):    2.801 s …  2.907 s    10 runs
 

Benchmark #2: ./idiom.sh
Time (mean ± σ):      42.4 ms ±   1.0 ms    [User: 40.7 ms, System: 1.1 ms]
Range (min … max):    40.5 ms …  44.8 ms    64 runs
 

Summary
'./idiom.sh' ran
67.06 ± 1.76 times faster than './function.sh'

If you really want a function ...

If you really want a function, say for readability reasons, I suggest the following:

# urldecode [-v var ] argument
#
#   Urldecode the argument and print the result.
#   It replaces '+' with SPACE and then percent decodes.
#   The output is consistent with https://meyerweb.com/eric/tools/dencoder/
#
# Options:
#   -v var    assign the output to shell variable VAR rather than
#             print it to standard output
#
urldecode() {
local assign_to_var=
local OPTIND opt
while getopts ':v:' opt; do
case $opt in
v)
local var=$OPTARG
assign_to_var=Y
;;
\?)
echo "$FUNCNAME: error: -$OPTARG: invalid option" >&2
return 1
;;
:)
echo "$FUNCNAME: error: -$OPTARG: this option requires an argument" >&2
return 1
;;
*)
echo "$FUNCNAME: error: an unexpected execution path has occurred." >&2
return 1
;;
esac
done
shift "$((OPTIND - 1))"
# Convert all '+' to ' '
: "${1//+/ }"
# We exploit that the $_ variable (last argument to the previous command
# after expansion) contains the result of the parameter expansion
if [[ $assign_to_var ]]; then
printf -v "$var" %b "${_//%/\\x}"
else
printf %b "${_//%/\\x}"
fi
}

Example 1: Printing the result to stdout

x='v%0A%0A%0A'
urldecode "$x" | od -An -tx1

Result:

 76 0a 0a 0a

Example 2: Assigning the result of decoding to a shell variable:

x='v%0A%0A%0A'
urldecode -v y "$x"
echo -n "$y" | od -An -tx1

(same result)

This function, while not as fast as the idiom above, is still 1300% faster than the accepted answer at doing assignments due to no subshell being involved. In addition, as shown in the example's output, it preserves trailing newlines due to no command substitution being involved.

used gridsite-clients

1. yum install gridsite-clients / or apt-get install gridsite-clients
2. grep -a 'http' access.log | xargs urlencode -d

Just a quick hint for other who are searching for a busybox compatible solution. In busybox shell you can use

httpd -d $ENCODED_URL

Example use case for busybox:

Download a file with wget and save it with the original decoded filename:

wget --no-check-certificate $ENCODED_URL -O $(basename $(httpd -d $ENCODED_URL))

If you prefer gawk, there's absolutely no need to force LC_ALL=C or gawk -b just to decode URL-encoded -

here's a fully functional proof-of-concept showcasing how gawk-unicode mode could directly decode purely binary files like MP3-audio or MP4-video files that were URL-encoded,and get back the exact same file, as confirmed by hashing.

It uses FS | OFS to handle the spaces that were set to +, similar to python3's quote-plus in their urllib :

( fg && fg && fg ) 2>/dev/null;
gls8x "${f}"
echo
pvE0 < "${f}" | xxh128sum | lgp3
echo ; echo
pvE0 < "${f}" | urlencodeAWKchk \
\
| gawk -ne '
BEGIN {
RS="[%][[:xdigit:]]{2}";
FS="[+]"
_=(4^5)*54  # if this offset doesn-t
# work, try
#           8^7
#               instead
  

} (NF+="_"*(ORS = sprintf("%.*s", RT != "",
sprintf("%c",\
_+("0x"  \
substr( RT, 2 ))))))~""' |pvE9|xxh128sum|lgp3


1 -rwxrwxrwx 1 5555 staff 9290187 May 27  2021 genieaudio_16277926_.lossless.mp3*
   



in0: 8.86MiB 0:00:00 [3.56GiB/s] [3.56GiB/s][=================>] 100%
5d43c221bf6c85abac80eea8dbb412a1  stdin




in0: 8.86MiB 0:00:00 [3.47GiB/s] [3.47GiB/s] [=================>] 100%
out9: 8.86MiB 0:00:05 [1.72MiB/s] [1.72MiB/s] [ <=>  ]


5d43c221bf6c85abac80eea8dbb412a1  stdin




1  -rw-r--r-- 1 5555 staff 215098877 Feb  8 17:30 vg3.mp4




in0:  205MiB 0:00:00 [2.66GiB/s] [2.66GiB/s] [=================>] 100%
          

2778670450b08cee694dcefc23cd4d93  stdin




in0:  205MiB 0:00:00 [3.31GiB/s] [3.31GiB/s] [=================>] 100%
out9:  205MiB 0:02:01 [1.69MiB/s] [1.69MiB/s] [ <=> ]
2778670450b08cee694dcefc23cd4d93  stdin

Minimalistic uridecode [-v varname] function:

Comming late on this SO Question (11 year ago), I see:

  • First answer suggesting the use of printf -v varname %b... was offer by jamp, near than 3 year after question was asked.
  • Fist answer offering a function for doing this was offered 10 years and 6 month after question, by Robin A. Meade.

Here is my smaller function:

uridecode() {
if [[ $1 == -v ]];then local -n _res="$2"; shift 2; else local _res; fi
: "${*//+/ }"; printf -v _res %b "${_//%/\\x}"
[[ ${_res@A} == _res=* ]] && echo "$_res"
}

Or less condensed:

uridecode() {
if [[ $1 == -v ]];then           # If 1st argument is ``-v''
local -n _res="$2"           # _res is a nameref to ``$2''
shift 2                      # drop 1st two arguments
else
local _res                   # _res is a local variable
fi
: "${*//+/ }"                    # _ hold argumenrs having ``+'' replaced by spaces
printf -v _res %b "${_//%/\\x}"  # store in _res rendered string
[[ ${_res@A} == _res=* ]] &&     # print _res if local
echo "$_res"
}

Usage:

uridecode Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en


uridecode -v myvar Hell%6f w%6Frld%21
echo $myvar
Hello world!

As I use $* instead of $1, and because URI doesn't hold special characters, there is no need to quote arguments.