Try James Tauber's Python Unicode Collation Algorithm. It may not do exactly as you want, but seems well worth a look. For a bit more information about the issues, see this post by Christopher Lenz.
IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.
Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651.
The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.
I see the answers have already done an excellent job, just wanted to point out one coding inefficiency in Human Sort. To apply a selective char-by-char translation to a unicode string s, it uses the code:
spec_dict = {'Å':'A', 'Ä':'A'}
def spec_order(s):
return ''.join([spec_dict.get(ch, ch) for ch in s])
Python has a much better, faster and more concise way to perform this auxiliary task (on Unicode strings -- the analogous method for byte strings has a different and somewhat less helpful specification!-):
spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)
def spec_order(s):
return s.translate(spec_dict)
The dict you pass to the translate method has Unicode ordinals (not strings) as keys, which is why we need that rebuilding step from the original char-to-char spec_dict. (Values in the dict you pass to translate [as opposed to keys, which must be ordinals] can be Unicode ordinals, arbitrary Unicode strings, or None to remove the corresponding character as part of the translation, so it's easy to specify "ignore a certain character for sorting purposes", "map ä to ae for sorting purposes", and the like).
In Python 3, you can get the "rebuilding" step more simply, e.g.:
spec_dict = ''.maketrans(spec_dict)
See the docs for other ways you can use this maketrans static method in Python 3.
It is far from a complete solution for your use case, but you could take a look at the unaccent.py script from effbot.org. What it basically does is remove all accents from a text. You can use that 'sanitized' text to sort alphabetically. (For a better description see this page.)
I don't see this in the answers. My Application sorts according to the locale using python's standard library. It is pretty easy.
# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]
import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")
corpus.sort(cmp=locale.strcoll)
# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)
Question to Lennart and other answerers: Doesn't anyone know 'locale' or is it not up to this task?
locale.strcoll under Python 2, and locale.strxfrm will in fact solve the problem, and does a good job, assuming that you have the locale in question installed. I tested it under Windows too, where the locale names confusingly are different, but on the other hand it seems to have all locales that are supported installed by default.
ICU doesn't necessarily do this better in practice, it however does way more. Most notably it has support for splitters that can split texts in different languages into words. This is very useful for languages that doesn't have word separators. You'll need to have a corpus of words to use as a base for the splitting, because that's not included, though.
It also has long names for the locales so you can get pretty display names for the locale, support for other calendars than Gregorian (although I'm not sure the Python interface supports that) and tons and tons of other more or less obscure locale supports.
So all in all: If you want to sort alphabetically and locale-dependent, you can use the locale module, unless you have special requirements, or also need more locale dependent functionality, like words splitter.
The simplest, easiest, and most straightforward way to do this it to make a callout to the Perl library module, Unicode::Collate::Locale, which is a subclass of the standard Unicode::Collate module. All you need do is pass the constructor a locale value of "xv" for Sweden.
(You may not neccesarily appreciate this for Swedish text, but because Perl uses abstract characters, you can use any Unicode code point you please — no matter the platform or build! Few languages offer such convenience. I mention it because I’ve fighting a losing battle with Java a lot over this maddening problem lately.)
The problem is that I do not know how to access a Perl module from Python — apart, that is, from using a shell callout or two-sided pipe. To that end, I have therefore provided you with a complete working script called ucsort that you can call to do exactly what you have asked for with perfect ease.
This script is 100% compliant with the full Unicode Collation Algorithm, with all tailoring options supported!! And if you have an optional module installed or run Perl 5.13 or better, then you have full access to easy-to-use CLDR locales. See below.
Demonstration
Imagine an input set ordered this way:
b o i j n l m å y e v s k h d f g t ö r x p z a ä c u q
A default sort by code point yields:
a b c d e f g h i j k l m n o p q r s t u v x y z ä å ö
which is incorrect by everybody’s book. Using my script, which uses the Unicode Collation Algorithm, you get this order:
% perl ucsort /tmp/swedish_alphabet | fmt
a å ä b c d e f g h i j k l m n o ö p q r s t u v x y z
That is the default UCA sort. To get the Swedish locale, call ucsort this way:
% perl ucsort --locale=sv /tmp/swedish_alphabet | fmt
a b c d e f g h i j k l m n o p q r s t u v x y z å ä ö
Here is a better input demo. First, the input set:
You can do many other things with ucsort. For example, here is how to sort titles in English:
% ucsort --preprocess='s/^(an?|the)\s+//i' /tmp/titles
Anathem
The Book of Skulls
A Civil Campaign
The Claw of the Conciliator
The Demolished Man
Dune
An Early Dawn
The Faded Sun: Kesrith
The Fall of Hyperion
A Feast for Crows
Flowers for Algernon
The Forbidden Tower
Foundation and Empire
Foundation’s Edge
The Goblin Reservation
The High Crusade
Jack of Shadows
The Man in the High Castle
The Ringworld Engineers
The Robots of Dawn
A Storm of Swords
Stranger in a Strange Land
There Will Be Time
The White Dragon
You will need Perl 5.10.1 or better to run the script in general. For locale support, you must either install the optional CPAN module Unicode::Collate::Locale. Alternately, you can install a development versions of Perl, 5.13+, which include that module standardly.
Calling Conventions
This is a rapid prototype, so ucsort is mostly un(der)documented. But this is its SYNOPSIS of what switches/options it accepts on the command line:
# standard options
--help|?
--man|m
--debug|d
# collator constructor options
--backwards-levels=i
--collation-level|level|l=i
--katakana-before-hiragana
--normalization|n=s
--override-CJK=s
--override-Hangul=s
--preprocess|P=s
--upper-before-lower|u
--variable=s
# program specific options
--case-insensitive|insensitive|i
--input-encoding|e=s
--locale|L=s
--paragraph|p
--reverse-fields|last
--reverse-output|r
--right-to-left|reverse-input
Yeah, ok: that’s really the argument list I use for the call to Getopt::Long, but you get the idea. :)
If you can figure out how to call Perl library modules from Python directly without calling a Perl script, by all means do so. I just don’t know how myself. I’d love to learn how.
In the meantime, I believe this script will do what you need done in all its particular — and more! I now use this for all of text sorting. It finally does what I’ve needed for a long, long time.
The only downside is that --locale argument causes performance to go down the tubes, although it’s plenty fast enough for regular, non-locale but still 100% UCA compliant sorting. Since it loads everything in memory, you probably don’t want to use this on gigabyte documents. I use it many times a day, and it sure it great having sane text sorting at last.
Though it is certainly not the most exact way, it is a very simple way to at least get it somewhat right. It also beats locale in a webapp as locale is not threadsafe and sets the language settings process-wide. It also easier to set up than PyICU which relies on an external C library.
I uploaded the script to github as the original was down at the time of this writing and I had to resort to web caches to get it: