Adding a follow-up to Bill's recommendation for using libcurl: great suggestion, and to be updated:
after 3 years, the curl_escape function is deprecated, so for future use it's better to use curl_easy_escape.
I ended up on this question when searching for an api to decode url in a win32 c++ app. Since the question doesn't quite specify platform assuming windows isn't a bad thing.
InternetCanonicalizeUrl is the API for windows programs. More info here
LPTSTR lpOutputBuffer = new TCHAR[1];
DWORD dwSize = 1;
BOOL fRes = ::InternetCanonicalizeUrl(strUrl, lpOutputBuffer, &dwSize, ICU_DECODE | ICU_NO_ENCODE);
DWORD dwError = ::GetLastError();
if (!fRes && dwError == ERROR_INSUFFICIENT_BUFFER)
{
delete lpOutputBuffer;
lpOutputBuffer = new TCHAR[dwSize];
fRes = ::InternetCanonicalizeUrl(strUrl, lpOutputBuffer, &dwSize, ICU_DECODE | ICU_NO_ENCODE);
if (fRes)
{
//lpOutputBuffer has decoded url
}
else
{
//failed to decode
}
if (lpOutputBuffer !=NULL)
{
delete [] lpOutputBuffer;
lpOutputBuffer = NULL;
}
}
else
{
//some other error OR the input string url is just 1 char and was successfully decoded
}
InternetCrackUrl (here) also seems to have flags to specify whether to decode url
I faced the encoding half of this problem the other day. Unhappy with the available options, and after taking a look at this C sample code, i decided to roll my own C++ url-encode function:
#include <cctype>
#include <iomanip>
#include <sstream>
#include <string>
using namespace std;
string url_encode(const string &value) {
ostringstream escaped;
escaped.fill('0');
escaped << hex;
for (string::const_iterator i = value.begin(), n = value.end(); i != n; ++i) {
string::value_type c = (*i);
// Keep alphanumeric and other accepted characters intact
if (isalnum(c) || c == '-' || c == '_' || c == '.' || c == '~') {
escaped << c;
continue;
}
// Any other characters are percent-encoded
escaped << uppercase;
escaped << '%' << setw(2) << int((unsigned char) c);
escaped << nouppercase;
}
return escaped.str();
}
The implementation of the decode function is left as an exercise to the reader. :P
#include <stddef.h>
#include <ctype.h>
/**
* decode a percent-encoded C string with optional path normalization
*
* The buffer pointed to by @dst must be at least strlen(@src) bytes.
* Decoding stops at the first character from @src that decodes to null.
* Path normalization will remove redundant slashes and slash+dot sequences,
* as well as removing path components when slash+dot+dot is found. It will
* keep the root slash (if one was present) and will stop normalization
* at the first questionmark found (so query parameters won't be normalized).
*
* @param dst destination buffer
* @param src source buffer
* @param normalize perform path normalization if nonzero
* @return number of valid characters in @dst
* @author Johan Lindh <johan@linkdata.se>
* @legalese BSD licensed (http://opensource.org/licenses/BSD-2-Clause)
*/
ptrdiff_t urldecode(char* dst, const char* src, int normalize)
{
char* org_dst = dst;
int slash_dot_dot = 0;
char ch, a, b;
do {
ch = *src++;
if (ch == '%' && isxdigit(a = src[0]) && isxdigit(b = src[1])) {
if (a < 'A') a -= '0';
else if(a < 'a') a -= 'A' - 10;
else a -= 'a' - 10;
if (b < 'A') b -= '0';
else if(b < 'a') b -= 'A' - 10;
else b -= 'a' - 10;
ch = 16 * a + b;
src += 2;
}
if (normalize) {
switch (ch) {
case '/':
if (slash_dot_dot < 3) {
/* compress consecutive slashes and remove slash-dot */
dst -= slash_dot_dot;
slash_dot_dot = 1;
break;
}
/* fall-through */
case '?':
/* at start of query, stop normalizing */
if (ch == '?')
normalize = 0;
/* fall-through */
case '\0':
if (slash_dot_dot > 1) {
/* remove trailing slash-dot-(dot) */
dst -= slash_dot_dot;
/* remove parent directory if it was two dots */
if (slash_dot_dot == 3)
while (dst > org_dst && *--dst != '/')
/* empty body */;
slash_dot_dot = (ch == '/') ? 1 : 0;
/* keep the root slash if any */
if (!slash_dot_dot && dst == org_dst && *dst == '/')
++dst;
}
break;
case '.':
if (slash_dot_dot == 1 || slash_dot_dot == 2) {
++slash_dot_dot;
break;
}
/* fall-through */
default:
slash_dot_dot = 0;
}
}
*dst++ = ch;
} while(ch);
return (dst - org_dst) - 1;
}
[Necromancer mode on]
Stumbled upon this question when was looking for fast, modern, platform independent and elegant solution. Didnt like any of above, cpp-netlib would be the winner but it has horrific memory vulnerability in "decoded" function. So I came up with boost's spirit qi/karma solution.
Ordinarily adding '%' to the int value of a char will not work when encoding, the value is supposed to the the hex equivalent. e.g '/' is '%2F' not '%47'.
I think this is the best and concise solutions for both url encoding and decoding (No much header dependencies).
string urlEncode(string str){
string new_str = "";
char c;
int ic;
const char* chars = str.c_str();
char bufHex[10];
int len = strlen(chars);
for(int i=0;i<len;i++){
c = chars[i];
ic = c;
// uncomment this if you want to encode spaces with +
/*if (c==' ') new_str += '+';
else */if (isalnum(c) || c == '-' || c == '_' || c == '.' || c == '~') new_str += c;
else {
sprintf(bufHex,"%X",c);
if(ic < 16)
new_str += "%0";
else
new_str += "%";
new_str += bufHex;
}
}
return new_str;
}
string urlDecode(string str){
string ret;
char ch;
int i, ii, len = str.length();
for (i=0; i < len; i++){
if(str[i] != '%'){
if(str[i] == '+')
ret += ' ';
else
ret += str[i];
}else{
sscanf(str.substr(i + 1, 2).c_str(), "%x", &ii);
ch = static_cast<char>(ii);
ret += ch;
i = i + 2;
}
}
return ret;
}
I know the question asks for a C++ method, but for those who might need it, I came up with a very short function in plain C to encode a string. It doesn't create a new string, rather it alters the existing one, meaning that it must have enough size to hold the new string. Very easy to keep up.
void urlEncode(char *string)
{
char charToEncode;
int posToEncode;
while (((posToEncode=strspn(string,"1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"))!=0) &&(posToEncode<strlen(string)))
{
charToEncode=string[posToEncode];
memmove(string+posToEncode+3,string+posToEncode+1,strlen(string+posToEncode));
string[posToEncode]='%';
string[posToEncode+1]="0123456789ABCDEF"[charToEncode>>4];
string[posToEncode+2]="0123456789ABCDEF"[charToEncode&0xf];
string+=posToEncode+3;
}
}
I couldn't find a URI decode/unescape here that also decodes 2 and 3 byte sequences. Contributing my own version, that on-the-fly converts the c sting input to a wstring: