C # 中的 Slugify 算法? ?

因此,我搜索并浏览了 SO 上的 鼻涕虫标签,只发现了两个引人注目的解决方案:

这只是问题的部分解决方案。我可以自己手动编写代码,但我感到惊讶的是,目前还没有一个解决方案。

那么,在 C # 和/或中有一个缓慢的算法实现吗。NET,适当地处理拉丁字符,Unicode 和各种其他语言的问题?

47532 次浏览

http://predicatet.blogspot.com/2009/04/improved-c-slug-generator-or-how-to.html

public static string GenerateSlug(this string phrase)
{
string str = phrase.RemoveAccent().ToLower();
// invalid chars
str = Regex.Replace(str, @"[^a-z0-9\s-]", "");
// convert multiple spaces into one space
str = Regex.Replace(str, @"\s+", " ").Trim();
// cut and trim
str = str.Substring(0, str.Length <= 45 ? str.Length : 45).Trim();
str = Regex.Replace(str, @"\s", "-"); // hyphens
return str;
}


public static string RemoveAccent(this string txt)
{
byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt);
return System.Text.Encoding.ASCII.GetString(bytes);
}

One problem I've had with slugification (new word!) is collisions. If I have a blog post, for instance, called "Stack-Overflow" and one called "Stack Overflow", the slugs of those two titles are the same. Therefore, my slug generator usually has to involve the database in some way. This might be why you don't see more generic solutions out there.

Here you find a way to generate url slug in c#. This function remove all accents(Marcel's answer), replace spaces, remove invalid chars, trim dashes from end and replace double occurences of "-" or "_"

Code:

public static string ToUrlSlug(string value){


//First to lower case
value = value.ToLowerInvariant();


//Remove all accents
var bytes = Encoding.GetEncoding("Cyrillic").GetBytes(value);
value = Encoding.ASCII.GetString(bytes);


//Replace spaces
value = Regex.Replace(value, @"\s", "-", RegexOptions.Compiled);


//Remove invalid chars
value = Regex.Replace(value, @"[^a-z0-9\s-_]", "",RegexOptions.Compiled);


//Trim dashes from end
value = value.Trim('-', '_');


//Replace double occurences of - or _
value = Regex.Replace(value, @"([-_]){2,}", "$1", RegexOptions.Compiled);


return value ;
}

Here is my rendition, based Joan's and Marcel's answers. The changes I made are as follows:

  • Use a widely accepted method to remove accents.
  • Explicit Regex caching for modest speed improvements.
  • More word separators recognized and normalized to hyphens.

Here is the code:

public class UrlSlugger
{
// white space, em-dash, en-dash, underscore
static readonly Regex WordDelimiters = new Regex(@"[\s—–_]", RegexOptions.Compiled);


// characters that are not valid
static readonly Regex InvalidChars = new Regex(@"[^a-z0-9\-]", RegexOptions.Compiled);


// multiple hyphens
static readonly Regex MultipleHyphens = new Regex(@"-{2,}", RegexOptions.Compiled);


public static string ToUrlSlug(string value)
{
// convert to lower case
value = value.ToLowerInvariant();


// remove diacritics (accents)
value = RemoveDiacritics(value);


// ensure all word delimiters are hyphens
value = WordDelimiters.Replace(value, "-");


// strip out invalid characters
value = InvalidChars.Replace(value, "");


// replace multiple hyphens (-) with a single hyphen
value = MultipleHyphens.Replace(value, "-");


// trim hyphens (-) from ends
return value.Trim('-');
}


/// See: http://www.siao2.com/2007/05/14/2629747.aspx
private static string RemoveDiacritics(string stIn)
{
string stFormD = stIn.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();


for (int ich = 0; ich < stFormD.Length; ich++)
{
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
if (uc != UnicodeCategory.NonSpacingMark)
{
sb.Append(stFormD[ich]);
}
}


return (sb.ToString().Normalize(NormalizationForm.FormC));
}
}

This still does not solve the non-latin character issue. A completely alternative solution would be to use Uri.EscapeDataString to convert the the string its hex representation:

string original = "测试公司";


// %E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8
string converted = Uri.EscapeDataString(original);

Then use the data to generate a hyperlink:

<a href="http://www.example.com/100/%E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8">
测试公司
</a>

Many browsers will display Chinese characters in the address bar (see below), but based on my limited testing, it is not completely supported.

address bar with Chinese characters

NOTE: In order for Uri.EscapeDataString to work this way, iriParsing must be enabled.


EDIT

For those looking to generate URL Slugs in C#, I recommend checking out this related question:

How does Stack Overflow generate its SEO-friendly URLs?

It is what I ended up using for my project.

Here is my shot at it. It supports:

  • removal of diacritics (so we don't just remove "invalid" characters)
  • max length for the result (or before removal of diacritics - "early truncate")
  • custom separator between normalized chunks
  • the result can be forced to uppercase or lowercase
  • configurable list of supported unicode categories
  • configurable list of ranges of allowed characters
  • supports framework 2.0

Code:

/// <summary>
/// Defines a set of utilities for creating slug urls.
/// </summary>
public static class Slug
{
/// <summary>
/// Creates a slug from the specified text.
/// </summary>
/// <param name="text">The text. If null if specified, null will be returned.</param>
/// <returns>
/// A slugged text.
/// </returns>
public static string Create(string text)
{
return Create(text, (SlugOptions)null);
}


/// <summary>
/// Creates a slug from the specified text.
/// </summary>
/// <param name="text">The text. If null if specified, null will be returned.</param>
/// <param name="options">The options. May be null.</param>
/// <returns>A slugged text.</returns>
public static string Create(string text, SlugOptions options)
{
if (text == null)
return null;


if (options == null)
{
options = new SlugOptions();
}


string normalised;
if (options.EarlyTruncate && options.MaximumLength > 0 && text.Length > options.MaximumLength)
{
normalised = text.Substring(0, options.MaximumLength).Normalize(NormalizationForm.FormD);
}
else
{
normalised = text.Normalize(NormalizationForm.FormD);
}
int max = options.MaximumLength > 0 ? Math.Min(normalised.Length, options.MaximumLength) : normalised.Length;
StringBuilder sb = new StringBuilder(max);
for (int i = 0; i < normalised.Length; i++)
{
char c = normalised[i];
UnicodeCategory uc = char.GetUnicodeCategory(c);
if (options.AllowedUnicodeCategories.Contains(uc) && options.IsAllowed(c))
{
switch (uc)
{
case UnicodeCategory.UppercaseLetter:
if (options.ToLower)
{
c = options.Culture != null ? char.ToLower(c, options.Culture) : char.ToLowerInvariant(c);
}
sb.Append(options.Replace(c));
break;


case UnicodeCategory.LowercaseLetter:
if (options.ToUpper)
{
c = options.Culture != null ? char.ToUpper(c, options.Culture) : char.ToUpperInvariant(c);
}
sb.Append(options.Replace(c));
break;


default:
sb.Append(options.Replace(c));
break;
}
}
else if (uc == UnicodeCategory.NonSpacingMark)
{
// don't add a separator
}
else
{
if (options.Separator != null && !EndsWith(sb, options.Separator))
{
sb.Append(options.Separator);
}
}


if (options.MaximumLength > 0 && sb.Length >= options.MaximumLength)
break;
}


string result = sb.ToString();


if (options.MaximumLength > 0 && result.Length > options.MaximumLength)
{
result = result.Substring(0, options.MaximumLength);
}


if (!options.CanEndWithSeparator && options.Separator != null && result.EndsWith(options.Separator))
{
result = result.Substring(0, result.Length - options.Separator.Length);
}


return result.Normalize(NormalizationForm.FormC);
}


private static bool EndsWith(StringBuilder sb, string text)
{
if (sb.Length < text.Length)
return false;


for (int i = 0; i < text.Length; i++)
{
if (sb[sb.Length - 1 - i] != text[text.Length - 1 - i])
return false;
}
return true;
}
}


/// <summary>
/// Defines options for the Slug utility class.
/// </summary>
public class SlugOptions
{
/// <summary>
/// Defines the default maximum length. Currently equal to 80.
/// </summary>
public const int DefaultMaximumLength = 80;


/// <summary>
/// Defines the default separator. Currently equal to "-".
/// </summary>
public const string DefaultSeparator = "-";


private bool _toLower;
private bool _toUpper;


/// <summary>
/// Initializes a new instance of the <see cref="SlugOptions"/> class.
/// </summary>
public SlugOptions()
{
MaximumLength = DefaultMaximumLength;
Separator = DefaultSeparator;
AllowedUnicodeCategories = new List<UnicodeCategory>();
AllowedUnicodeCategories.Add(UnicodeCategory.UppercaseLetter);
AllowedUnicodeCategories.Add(UnicodeCategory.LowercaseLetter);
AllowedUnicodeCategories.Add(UnicodeCategory.DecimalDigitNumber);
AllowedRanges = new List<KeyValuePair<short, short>>();
AllowedRanges.Add(new KeyValuePair<short, short>((short)'a', (short)'z'));
AllowedRanges.Add(new KeyValuePair<short, short>((short)'A', (short)'Z'));
AllowedRanges.Add(new KeyValuePair<short, short>((short)'0', (short)'9'));
}


/// <summary>
/// Gets the allowed unicode categories list.
/// </summary>
/// <value>
/// The allowed unicode categories list.
/// </value>
public virtual IList<UnicodeCategory> AllowedUnicodeCategories { get; private set; }


/// <summary>
/// Gets the allowed ranges list.
/// </summary>
/// <value>
/// The allowed ranges list.
/// </value>
public virtual IList<KeyValuePair<short, short>> AllowedRanges { get; private set; }


/// <summary>
/// Gets or sets the maximum length.
/// </summary>
/// <value>
/// The maximum length.
/// </value>
public virtual int MaximumLength { get; set; }


/// <summary>
/// Gets or sets the separator.
/// </summary>
/// <value>
/// The separator.
/// </value>
public virtual string Separator { get; set; }


/// <summary>
/// Gets or sets the culture for case conversion.
/// </summary>
/// <value>
/// The culture.
/// </value>
public virtual CultureInfo Culture { get; set; }


/// <summary>
/// Gets or sets a value indicating whether the string can end with a separator string.
/// </summary>
/// <value>
///   <c>true</c> if the string can end with a separator string; otherwise, <c>false</c>.
/// </value>
public virtual bool CanEndWithSeparator { get; set; }


/// <summary>
/// Gets or sets a value indicating whether the string is truncated before normalization.
/// </summary>
/// <value>
///   <c>true</c> if the string is truncated before normalization; otherwise, <c>false</c>.
/// </value>
public virtual bool EarlyTruncate { get; set; }


/// <summary>
/// Gets or sets a value indicating whether to lowercase the resulting string.
/// </summary>
/// <value>
///   <c>true</c> if the resulting string must be lowercased; otherwise, <c>false</c>.
/// </value>
public virtual bool ToLower
{
get
{
return _toLower;
}
set
{
_toLower = value;
if (_toLower)
{
_toUpper = false;
}
}
}


/// <summary>
/// Gets or sets a value indicating whether to uppercase the resulting string.
/// </summary>
/// <value>
///   <c>true</c> if the resulting string must be uppercased; otherwise, <c>false</c>.
/// </value>
public virtual bool ToUpper
{
get
{
return _toUpper;
}
set
{
_toUpper = value;
if (_toUpper)
{
_toLower = false;
}
}
}


/// <summary>
/// Determines whether the specified character is allowed.
/// </summary>
/// <param name="character">The character.</param>
/// <returns>true if the character is allowed; false otherwise.</returns>
public virtual bool IsAllowed(char character)
{
foreach (var p in AllowedRanges)
{
if (character >= p.Key && character <= p.Value)
return true;
}
return false;
}


/// <summary>
/// Replaces the specified character by a given string.
/// </summary>
/// <param name="character">The character to replace.</param>
/// <returns>a string.</returns>
public virtual string Replace(char character)
{
return character.ToString();
}
}