Sorting strings with regard to unicode

Clash Royale CLAN TAG#URR8PPP
Sorting strings with regard to unicode
I have a list which I want to sort alphabetically, but with regard to unicode
iex(2)> ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"] |> Enum.sort
["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"]
# the above is wrong, it should be:
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
How can I achieve that in Elixir? Usage of some Hex packages is acceptable.
3 Answers
3
The proper way to handle sorting would be to bring all the characters to decomposed unicode form and sort. The issue is for some reason "ł" is not considered a composed form:
"ł"
letters
|> Enum.map(&:unicode.characters_to_nfd_binary/1)
|> Enum.map(&String.codepoints/1)
#⇒ [
# ["a"],
# ["a", "̨"],
# ["b"],
# ["c"],
# ["c", "́"],
# ["d"],
# ["e"],
# ["e", "̨"],
# ["f"],
# ["g"],
# ["h"],
# ["i"],
# ["j"],
# ["k"],
# ["l"],
# ["ł"],
# ["m"],
# ["n"],
# ["n", "́"],
# ["o"],
# ["o", "́"],
# ["p"],
# ["q"],
# ["r"],
# ["s"],
# ["s", "́"],
# ["t"],
# ["u"],
# ["w"],
# ["y"],
# ["z"],
# ["z", "́"],
# ["z", "̇"]
# ]
I have no idea why "ł" is not declared as a composed letter, also I would consider this being a bug in the consortium papers. Anyway, we might fool the sorter:
"ł"
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
|> Enum.map(&:unicode.characters_to_nfd_binary/1)
|> Enum.map(&String.replace(&1, "ł", "l�"))
|> Enum.sort()
|> Enum.map(&String.replace(&1, "l�", "ł"))
#⇒ ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
Now it’s working with any input, both composed and decomposed.
Far from perfect, but works.
It doesn't work for me:
my.exs:
defmodule Stuff do
def numeric_for_sort(string) do
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
String.graphemes(string)
|> Enum.map(fn(x) -> Enum.find_index(letters, fn(y) -> x == y end) end)
end
end
^C~/elixir_programs$ iex my.exs
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Enum.sort(["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"], &(Stuff.numeric_for_sort(&1["name"]) <= Stuff.numeric_for_sort(&2["name"])))
** (FunctionClauseError) no function clause matching in Access.get/3
The following arguments were given to Access.get/3:
# 1
"lubelskie"
# 2
"name"
# 3
nil
(elixir) lib/access.ex:306: Access.get/3
(stdlib) erl_eval.erl:670: :erl_eval.do_apply/6
(stdlib) erl_eval.erl:878: :erl_eval.expr_list/6
(stdlib) erl_eval.erl:404: :erl_eval.expr/5
(stdlib) erl_eval.erl:469: :erl_eval.expr/5
(stdlib) lists.erl:969: :lists.sort/2
(FunctionClauseError) no function clause matching in Access.get/3`.
And, I don't think you want to use a list for the letters because then you have to continually traverse the list searching for letters. That is what maps are for. (Edit: Well what do I know: small maps are ordered lists where the map has <= 31 entries) So, something like this:
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
letter_rank = Map.new Enum.with_index letters
String.graphemes(string)
|> Enum.map(fn(x) -> letter_rank[x] end)
Then:
names = ["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
iex(2)> Enum.sort_by names, &Stuff.numeric_for_sort/1
["lubelskie", "łódzkie", "mazowieckie", "zachodniopomorskie"]
iex(3)>
According to the Enum.sort_by/3 docs:
sort_by/3 differs from sort/2 in that it only calculates the comparison value for each element in the enumerable once instead of
once for each element in each comparison. If the same function is
being called on both elements, it’s also more compact to use
sort_by/3.
There are many comparisons done while sorting, and it's obviously not ideal to calculate the numeric list for each name over and over again for every comparison done by the sort algorithm.
Note that even though this line:
Enum.sort_by names, &Stuff.numeric_for_sort/1
looks like it is calling sort_by/2, it is actually calling sort_by/3 with a default third argument of &<=/2.
&<=/2
This to copy-paste
["lubelskie", "łódzkie"] and get surprised. One cannot avoid dealing with combined diacritics.– mudasobwa
Aug 6 at 7:17
["lubelskie", "łódzkie"]
So far, since the alphabet which is used is well-defined, I ended up creating my own sorting function:
defp numeric_for_sort(string) do
letters = ["a", "ą", "b", "c", "ć", "d", "e", "ę", "f", "g", "h", "i", "j", "k", "l", "ł",
"m", "n", "ń", "o", "ó", "p", "q", "r", "s", "ś", "t", "u", "w", "y", "z", "ź", "ż"]
String.graphemes(string)
|> Enum.map(fn(x) -> Enum.find_index(letters, fn(y) -> x == y end) end)
end
And then
Enum.sort(["lubelskie", "mazowieckie", "zachodniopomorskie", "łódzkie"], &(numeric_for_sort(&1["name"]) <= numeric_for_sort(&2["name"])))
Far from perfect, but works.
It does not work. You completely ignore combined diacritics.
– mudasobwa
Aug 6 at 6:50
@mudasobwa I'm not sure what you mean by that
– katafrakt
Aug 6 at 9:50
If the input comes as combined diacritics, namely o followed by an accent, your sorter will fail.
– mudasobwa
Aug 6 at 9:54
but it works... paste.org/94426 Maybe won't work in 100% cases - is that what you mean?
– katafrakt
Aug 6 at 10:07
With combined diacritics there are 8 characters, copy-paste my comment to another answer.
– mudasobwa
Aug 6 at 10:10
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What the heck does "with regard to Unicode" mean? Is it by code-unit (which charset?), code-point, some language-specific mapping, what?
– Deduplicator
Aug 5 at 22:44