如何创建URL缩短器？

小开

为什么要使用哈希？

您可以简单地将自动增量值转换为字母数字值。您可以通过使用一些基数转换轻松做到这一点。假设您的字符空间（A-Z、a-z、0-9等）有62个字符，将id转换为基数40的数字并使用这些字符作为数字。

小开

为什么不直接将id转换为字符串呢？你只需要一个函数将0和61之间的数字映射到单个字母（大写/小写）或数字。然后应用它来创建，比如，4个字母的代码，你已经涵盖了1470万URL。

小开

最佳答案

我会继续你的“将数字转换为字符串”的方法。然而，你会意识到，如果你的ID是质数且大于52，你提出的算法会失败。

理论背景

您需要一个双射函数f。这是必要的，以便您可以为您的f（123）='abc'函数找到一个逆函数g（'abc'）=123。这意味着：

必须没有x1， x2（x1≤x2）会使f（x1）=f（x2），
对于每个y，您必须能够找到一个x，以便f（x）=y。

如何将ID转换为缩短的URL

想想我们想要使用的字母表。在你的情况下，那是[a-zA-Z0-9]。它包含62封信。
取一个自动生成的唯一数字键（例如MySQL表的自动递增id）。

对于这个例子，我将使用125₁₀（125，基数为10）。
现在您必须将125₁₀转换为X₆₂（基数62）。

125₁₀=2×62¹+1×62=[2,1]

这需要使用整数除法和取模。伪代码示例：
```
digits = []


while num > 0
remainder = modulo(num, 62)
digits.push(remainder)
num = divide(num, 62)


digits = digits.reverse
```
现在将指数2和1映射到您的字母表。这是您的映射（例如数组）的样子：
```
0  → a
1  → b
...
25 → z
...
52 → 0
61 → 9
```
使用2→c和1→b，您将收到cb₆₂作为缩短的URL。
```
http://shor.ty/cb
```

How to resolve a shortened URL to the initial ID

The reverse is even easier. You just do a reverse lookup in your alphabet.

e9a₆₂ will be resolved to "4th, 61st, and 0th letter in the alphabet".

e9a₆₂ = [4,61,0] = 4×62² + 61×62¹ + 0×62⁰ = 19158₁₀
Now find your database-record with WHERE id = 19158 and do the redirect.

Example implementations (provided by commenters)

小开

如果你不想重新发明轮子…http://lilurl.sourceforge.net/

小开

不是你问题的答案，但我不会使用区分大小写的缩短URL。它们很难记住，通常无法阅读（许多字体呈现1和l，0和O以及其他非常相似的字符，它们几乎不可能区分）并且容易出错。尝试仅使用小写或大写。

另外，尝试一种将数字和字符以预定义形式混合的格式。有研究表明，人们往往比其他人更好地记住一种形式（想想电话号码，其中数字以特定形式分组）。尝试像num-char-char-num-char-char这样的东西。我知道这会降低组合，特别是如果你没有大小写，但它会更有用，因此更有用。

小开

我的方法：使用数据库ID，然后Base36对其进行编码。我不会同时使用大写和小写字母，因为这使得通过电话传输这些URL成为一场噩梦，但您当然可以轻松地将功能扩展为基本62 en/解码器。

小开

alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))


def lookup(k, a=alphabet):
if type(k) == int:
return a[k]
elif type(k) == str:
return a.index(k)




def encode(i, a=alphabet):
'''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
try:
i = int(i)
except Exception:
raise TypeError("Input must be an integer.")


def incode(i=i, p=1, a=a):
# Here to protect p.
if i <= 61:
return lookup(i)


else:
pval = pow(62,p)
nval = i/pval
remainder = i % pval
if nval <= 61:
return lookup(nval) + incode(i % pval)
else:
return incode(i, p+1)


return incode()






def decode(s, a=alphabet):
'''Takes a base 62 string in our alphabet and returns it in base10.'''
try:
s = str(s)
except Exception:
raise TypeError("Input must be a string.")


return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a

这是我的版本，任何人都需要它。

小开

您可以散列整个URL，但如果您只想缩短id，请按照marcel的建议进行。我写了这个Python实现：

https://gist.github.com/778542

小开

这就是我使用的：

# Generate a [0-9a-zA-Z] string
ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) + range(65, 91))


def encode_id(id_number, alphabet=ALPHABET):
"""Convert an integer to a string."""
if id_number == 0:
return alphabet[0]


alphabet_len = len(alphabet) # Cache


result = ''
while id_number > 0:
id_number, mod = divmod(id_number, alphabet_len)
result = alphabet[mod] + result


return result


def decode_id(id_string, alphabet=ALPHABET):
"""Convert a string to an integer."""
alphabet_len = len(alphabet) # Cache
return sum([alphabet.index(char) * pow(alphabet_len, power) for power, char in enumerate(reversed(id_string))])

它非常快，可能需要很长的整数。

小开

对于一个类似的项目，为了获得一个新键，我围绕随机串发生器创建了一个包装函数，它调用生成器，直到我得到一个还没有在我的哈希表中使用的字符串。一旦你的名称空间开始满，这个方法就会变慢，但正如你所说，即使只有6个字符，你也有很多名称空间可以使用。

小开

这是我的PHP 5类。

<?php
class Bijective
{
public $dictionary = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";


public function __construct()
{
$this->dictionary = str_split($this->dictionary);
}


public function encode($i)
{
if ($i == 0)
return $this->dictionary[0];


$result = '';
$base = count($this->dictionary);


while ($i > 0)
{
$result[] = $this->dictionary[($i % $base)];
$i = floor($i / $base);
}


$result = array_reverse($result);


return join("", $result);
}


public function decode($input)
{
$i = 0;
$base = count($this->dictionary);


$input = str_split($input);


foreach($input as $char)
{
$pos = array_search($char, $this->dictionary);


$i = $i * $base + $pos;
}


return $i;
}
}

小开

// simple approach


$original_id = 56789;


$shortened_id = base_convert($original_id, 10, 36);


$un_shortened_id = base_convert($shortened_id, 36, 10);

小开

这是一个不错的PHP URL编码函数…

// From http://snipplr.com/view/22246/base62-encode--decode/
private function base_encode($val, $base=62, $chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') {
$str = '';
do {
$i = fmod($val, $base);
$str = $chars[$i] . $str;
$val = ($val - $i) / $base;
} while($val > 0);
return $str;
}

小开

不知道是否有人会发现这很有用-它更像是一个“hack n slash”方法，但如果你只想要特定的字符，它很简单，效果很好。

$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);


// Encode
$str_id = '';
$base = count($dictionary);


while($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $dictionary[$rem];
}




// Decode
$id_ar = str_split($str_id);
$id = 0;


for($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i - 1);
}

小开

你是故意省略O，0和i的吗？

我刚刚根据Ryan的解决方案创建了一个PHP类。

<?php


$shorty = new App_Shorty();


echo 'ID: ' . 1000;
echo '<br/> Short link: ' . $shorty->encode(1000);
echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty->encode(1000));




/**
* A nice shorting class based on Ryan Charmley's suggestion see the link on Stack Overflow below.
* @author Svetoslav Marinov (Slavi) | http://WebWeb.ca
* @see http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/10386945#10386945
*/
class App_Shorty {
/**
* Explicitly omitted: i, o, 1, 0 because they are confusing. Also use only lowercase ... as
* dictating this over the phone might be tough.
* @var string
*/
private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
private $dictionary_array = array();


public function __construct() {
$this->dictionary_array = str_split($this->dictionary);
}


/**
* Gets ID and converts it into a string.
* @param int $id
*/
public function encode($id) {
$str_id = '';
$base = count($this->dictionary_array);


while ($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $this->dictionary_array[$rem];
}


return $str_id;
}


/**
* Converts /abc into an integer ID
* @param string
* @return int $id
*/
public function decode($str_id) {
$id = 0;
$id_ar = str_split($str_id);
$base = count($this->dictionary_array);


for ($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i - 1], $this->dictionary_array) * pow($base, $i - 1);
}
return $id;
}
}
?>

小开

public class UrlShortener {
private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static final int    BASE     = ALPHABET.length();


public static String encode(int num) {
StringBuilder sb = new StringBuilder();
while ( num > 0 ) {
sb.append( ALPHABET.charAt( num % BASE ) );
num /= BASE;
}
return sb.reverse().toString();
}


public static int decode(String str) {
int num = 0;
for ( int i = 0; i < str.length(); i++ )
num = num * BASE + ALPHABET.indexOf(str.charAt(i));
return num;
}
}

小开

C#版本：

public class UrlShortener
{
private static String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static int    BASE     = 62;


public static String encode(int num)
{
StringBuilder sb = new StringBuilder();


while ( num > 0 )
{
sb.Append( ALPHABET[( num % BASE )] );
num /= BASE;
}


StringBuilder builder = new StringBuilder();
for (int i = sb.Length - 1; i >= 0; i--)
{
builder.Append(sb[i]);
}
return builder.ToString();
}


public static int decode(String str)
{
int num = 0;


for ( int i = 0, len = str.Length; i < len; i++ )
{
num = num * BASE + ALPHABET.IndexOf( str[(i)] );
}


return num;
}
}

小开

我有一个变体的问题，因为我存储来自许多不同作者的网页，并且需要防止通过猜测发现页面。所以我的短URL在Base-62字符串中为页码添加了几个额外的数字。这些额外的数字是根据页面记录本身的信息生成的，它们确保3844个URL中只有1个是有效的（假设2位Base-62）。您可以在http://mgscan.com/MBWL处看到大纲描述。

小开

很好的回答，我已经创建了bjf的Golang实现：

package bjf


import (
"math"
"strings"
"strconv"
)


const alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"


func Encode(num string) string {
n, _ := strconv.ParseUint(num, 10, 64)
t := make([]byte, 0)


/* Special case */
if n == 0 {
return string(alphabet[0])
}


/* Map */
for n > 0 {
r := n % uint64(len(alphabet))
t = append(t, alphabet[r])
n = n / uint64(len(alphabet))
}


/* Reverse */
for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
t[i], t[j] = t[j], t[i]
}


return string(t)
}


func Decode(token string) int {
r := int(0)
p := float64(len(token)) - 1


for i := 0; i < len(token); i++ {
r += strings.Index(alphabet, string(token[i])) * int(math.Pow(float64(len(alphabet)), p))
p--
}


return r
}

托管在github：https://github.com/xor-gate/go-bjf

小开

public class TinyUrl {
    

private final String characterMap = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private final int charBase = characterMap.length();
    

public String covertToCharacter(int num){
StringBuilder sb = new StringBuilder();
    

while (num > 0){
sb.append(characterMap.charAt(num % charBase));
num /= charBase;
}
    

return sb.reverse().toString();
}
    

public int covertToInteger(String str){
int num = 0;
for(int i = 0 ; i< str.length(); i++)
num += characterMap.indexOf(str.charAt(i)) * Math.pow(charBase , (str.length() - (i + 1)));
    

return num;
}
}
    

class TinyUrlTest{
    

public static void main(String[] args) {
TinyUrl tinyUrl = new TinyUrl();
int num = 122312215;
String url = tinyUrl.covertToCharacter(num);
System.out.println("Tiny url:  " + url);
System.out.println("Id: " + tinyUrl.covertToInteger(url));
}
}

小开

我的Python 3版本

base_list = list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
base = len(base_list)


def encode(num: int):
result = []
if num == 0:
result.append(base_list[0])


while num > 0:
result.append(base_list[num % base])
num //= base


print("".join(reversed(result)))


def decode(code: str):
num = 0
code_list = list(code)
for index, code in enumerate(reversed(code_list)):
num += base_list.index(code) * base ** index
print(num)


if __name__ == '__main__':
encode(341413134141)
decode("60FoItT")

小开

这是一个Node.js实现，可能bit.ly.生成一个高度随机的七字符字符串。

它使用Node.js加密来生成一个高度随机的25字符集，而不是随机选择7个字符。

var crypto = require("crypto");
exports.shortURL = new function () {
this.getShortURL = function () {
var sURL = '',
_rand = crypto.randomBytes(25).toString('hex'),
_base = _rand.length;
for (var i = 0; i < 7; i++)
sURL += _rand.charAt(Math.floor(Math.random() * _rand.length));
return sURL;
};
}

小开

有关高质量的Node.js/JavaScript解决方案，请参阅id缩短器模块，该模块经过全面测试并已在生产中使用数月。

它提供了一个有效的id/URL缩短器，由默认为Redis的可插拔存储支持，您甚至可以自定义您的短id字符集以及缩短是否为幂等。这是一个重要的区别，并非所有URL缩短器都考虑到。

关于这里的其他答案，这个模块实现了上面Marcel Jackwerth的优秀公认答案。

该解决方案的核心由以下Redis Lua代码段提供：

local sequence = redis.call('incr', KEYS[1])


local chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz'
local remaining = sequence
local slug = ''


while (remaining > 0) do
local d = (remaining % 60)
local character = string.sub(chars, d + 1, d + 1)


slug = character .. slug
remaining = (remaining - d) / 60
end


redis.call('hset', KEYS[2], slug, ARGV[1])


return slug

小开

在Scala中实现：

class Encoder(alphabet: String) extends (Long => String) {


val Base = alphabet.size


override def apply(number: Long) = {
def encode(current: Long): List[Int] = {
if (current == 0) Nil
else (current % Base).toInt :: encode(current / Base)
}
encode(number).reverse
.map(current => alphabet.charAt(current)).mkString
}
}


class Decoder(alphabet: String) extends (String => Long) {


val Base = alphabet.size


override def apply(string: String) = {
def decode(current: Long, encodedPart: String): Long = {
if (encodedPart.size == 0) current
else decode(current * Base + alphabet.indexOf(encodedPart.head),encodedPart.tail)
}
decode(0,string)
}
}

使用Scala测试的测试示例：

import org.scalatest.{FlatSpec, Matchers}


class DecoderAndEncoderTest extends FlatSpec with Matchers {


val Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"


"A number with base 10" should "be correctly encoded into base 62 string" in {
val encoder = new Encoder(Alphabet)
encoder(127) should be ("cd")
encoder(543513414) should be ("KWGPy")
}


"A base 62 string" should "be correctly decoded into a number with base 10" in {
val decoder = new Decoder(Alphabet)
decoder("cd") should be (127)
decoder("KWGPy") should be (543513414)
}


}

小开

Node.js和MongoDB解决方案

因为我们知道MongoDB用来创建一个12字节的新ObjectId的格式。

一个4字节的值，表示自Unix纪元以来的秒数，
一个3字节的机器标识符，
2字节进程ID
一个3字节计数器（在您的机器中），从随机值开始。

示例（我选择一个随机序列） a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 1 k 2 l 3

a1b2c3d4表示自Unix纪元以来的秒数，
4e5f6g7表示机器标识符，
h8i9表示进程id
j1k2l3表示计数器，从随机值开始。

由于计数器将是唯一的，如果我们将数据存储在同一台机器中，我们可以毫无疑问地获得它，它将是重复的。

所以短URL将是计数器，这是一个代码片段，假设您的服务器正常运行。

const mongoose = require('mongoose');
const Schema = mongoose.Schema;


// Create a schema
const shortUrl = new Schema({
long_url: { type: String, required: true },
short_url: { type: String, required: true, unique: true },
});
const ShortUrl = mongoose.model('ShortUrl', shortUrl);


// The user can request to get a short URL by providing a long URL using a form


app.post('/shorten', function(req ,res){
// Create a new shortUrl */
// The submit form has an input with longURL as its name attribute.
const longUrl = req.body["longURL"];
const newUrl = ShortUrl({
long_url : longUrl,
short_url : "",
});
const shortUrl = newUrl._id.toString().slice(-6);
newUrl.short_url = shortUrl;
console.log(newUrl);
newUrl.save(function(err){
console.log("the new URL is added");
})
});

小开

我一直在数据库中为每个域递增一个整数序列，并使用Hashids将整数编码为URL路径。

static hashids = Hashids(salt = "my app rocks", minSize = 6)

我运行了一个脚本，看看它需要多长时间才能耗尽字符长度。对于六个字符，它可以执行164,916,224链接，然后最多可以执行七个字符。Bitly使用七个字符。五个字符以下的字符在我看来很奇怪。

Hashids可以将URL路径解码回整数，但更简单的解决方案是使用整个短链接sho.rt/ka8ds3作为主键。

以下是完整的概念：

function addDomain(domain) {
table("domains").insert("domain", domain, "seq", 0)
}


function addURL(domain, longURL) {
seq = table("domains").where("domain = ?", domain).increment("seq")
shortURL = domain + "/" + hashids.encode(seq)
table("links").insert("short", shortURL, "long", longURL)
return shortURL
}


// GET /:hashcode
function handleRequest(req, res) {
shortURL = req.host + "/" + req.param("hashcode")
longURL = table("links").where("short = ?", shortURL).get("long")
res.redirect(301, longURL)
}

小开

基于XeonCross类的函数

function shortly($input){
$dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9'];
if($input===0)
return $dictionary[0];
$base = count($dictionary);
if(is_numeric($input)){
$result = [];
while($input > 0){
$result[] = $dictionary[($input % $base)];
$input = floor($input / $base);
}
return join("", array_reverse($result));
}
$i = 0;
$input = str_split($input);
foreach($input as $char){
$pos = array_search($char, $dictionary);
$i = $i * $base + $pos;
}
return $i;
}

小开

为什么不直接生成一个随机字符串并将其附加到基本URL？这是在c#中执行此操作的非常简化的版本。

static string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
static string baseUrl = "https://google.com/";


private static string RandomString(int length)
{
char[] s = new char[length];
Random rnd = new Random();
for (int x = 0; x < length; x++)
{
s[x] = chars[rnd.Next(chars.Length)];
}
Thread.Sleep(10);


return new String(s);
}

然后只需将随机字符串附加到base URL：

string tinyURL = baseUrl + RandomString(5);

请记住，这是这样做的一个非常简化的版本，并且R的方法可能会创建重复的字符串。在生产中，您需要考虑到重复的字符串，以确保您始终拥有唯一的URL。我有一些代码，通过查询数据库表来考虑重复字符串，如果有人感兴趣，我可以分享。

小开

这是我的初步想法，还可以做更多的思考，或者做一些模拟，看看效果好不好或者有什么需要改进的地方：

我的答案是记住数据库中的长URL，并使用ID0到9999999999999999（或者需要多大的数字）。

但是ID 0到9999999999999999可能是一个问题，因为

如果我们使用十六进制，它可以更短，甚至可以使用base 62或base 64。（就像YouTube使用A-Za-z0-9_和-一样）
如果它从0统一增加到9999999999999999，那么黑客可以按照这个顺序访问他们，并知道人们互相发送的URL，所以这可能是一个隐私问题

我们可以这样做：

让一台服务器将0到999分配给一台服务器，服务器A，所以现在服务器A有1000个这样的ID。因此，如果有20或200台服务器不断想要新ID，它不必一直要求每个新ID，而是要求一次1000个ID
例如，对于ID 1，将位反转。因此000...00000001变为10000...000，因此当转换为base 64时，每次都会非均匀地增加ID。
使用异或来翻转最终ID的位。例如，使用0xD5AA96...2373进行异或（类似于密钥），并且某些位将被翻转。（每当密钥打开1位时，它将翻转ID的位）。这将使ID更难以猜测并且看起来更随机

按照这种方案，分配ID的单个服务器可以形成ID，请求分配ID的20或200台服务器也可以形成ID。分配服务器必须使用锁/信号量来防止两个请求服务器获得同一批（或者如果它一次接受一个连接，这已经解决了问题）。因此，我们不希望等待分配的线太长。这就是为什么一次分配1000或10000可以解决问题。

小开

看看https://hashids.org/，它是开源的，有多种语言。

他们的页面概述了其他方法的一些陷阱。