Does PostgreSQL support “accent insensitive” collations?

In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like

SELECT * FROM users WHERE name LIKE 'João'

to find a row with a Joao name.

Joao

I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.

SELECT

See this answer for creating an FTS dictionary with unaccent: stackoverflow.com/a/50595181/124486
– Evan Carroll
May 30 at 1:38

Do you want case-sensitivity or case insensitive searches?
– Evan Carroll
May 30 at 1:57

3 Answers
3

Use the unaccent module for that - which is completely different from what you are linking to.

unaccent is a text search dictionary that removes accents (diacritic
signs) from lexemes.

Install once per database with:

CREATE EXTENSION unaccent;

If you get an error like:

ERROR: could not open extension control file
"/usr/share/postgresql/9.x/extension/unaccent.control": No such file
or directory

Install the contrib package on your database server like instructed in this related answer:

Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).

unaccent()

LIKE

SELECT * FROM users WHERE unaccent(name) = unaccent('João');

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.

IMMUTABLE

unaccent()

STABLE

IMMUTABLE

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:

unaccent()

STABLE

IMMUTABLE

search_path

Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.

IMMUTABLE

Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).

IMMUTABLE

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

IMMUTABLE

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:

unaccent()

I propose an approach that is at least as efficient as other solutions floating around, but safer:
Create a wrapper function with the two-parameter form and "hard-wire" the schema for function and dictionary:

CREATE OR REPLACE FUNCTION public.f_unaccent(text) RETURNS text AS $func$ SELECT public.unaccent('public.unaccent', $1) -- schema-qualify function and dictionary $func$ LANGUAGE sql IMMUTABLE;

public being the schema where you installed the extension (public is the default).

public

Previously, I had added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too, which is currently (pg 10) not documented. This version is a bit shorter and around twice as fast in my tests on pg 9.5 and pg 10.

SET search_path = public, pg_temp

The updated version still doesn't allow function inlining because functions declared IMMUTABLE may not call non-immutable functions in the body to allow that. Hardly matters for performance while we make use of an expression index on this IMMUTABLE function:

IMMUTABLE

CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));

Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary as demonstrated when used in any indexes. See:

Adapt your queries to match the index (so the query planner can use it):

SELECT * FROM users WHERE f_unaccent(name) = f_unaccent('João');

You don't need the function in the right expression. You can supply unaccented strings like 'Joao' directly.

'Joao'

In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:

unaccent()

SELECT unaccent('Œ Æ œ æ ß'); unaccent ---------- E A e a S

You will love this update to unaccent in Postgres 9.6:

Extend contrib/unaccent's standard unaccent.rules file to handle all
diacritics known to Unicode, and expand ligatures correctly (Thomas
Munro, Léonard Benedetti)

contrib/unaccent

unaccent.rules

Bold emphasis mine. Now we get:

SELECT unaccent('Œ Æ œ æ ß'); unaccent ---------- OE AE oe ae ss

For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:

LIKE

ILIKE

pg_trgm

CREATE INDEX users_unaccent_name_trgm_idx ON users USING gin (f_unaccent(name) gin_trgm_ops);

Can be used for queries like:

SELECT * FROM users WHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN and GIST indexes are more expensive to maintain than plain btree:

There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:

pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).

pg_trgm

%

<->

Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:

~

ILIKE

In your solution, are indexes used, or would I need to create an index on unaccent(name)?
– Daniel Serodio
Jun 13 '12 at 14:45

unaccent(name)

@ErwinBrandstetter In psql 9.1.4, I get "functions in index expression must be marked IMMUTABLE", because of the unaccent function is STABLE, instead of INMUTABLE. What do you recommend?
– e3matheus
Jun 4 '13 at 18:17

@e3matheus: Feeling guilty for not having tested the previous solution I provided, I investigated and updated my answer with a new and better (IMHO) solution for the problem than what is floating around so far.
– Erwin Brandstetter
Jun 5 '13 at 1:10

Isn't the collation utf8_general_ci the answer for this kind of issues?
– Med
Apr 1 '14 at 14:36

utf8_general_ci

Your answers are as good as Postgres documentation : phenomenal!
– electrotype
Oct 22 '17 at 20:58

I'm pretty sure PostgreSQL relies on the underlying operating system for collation. It does support creating new collations, and customizing collations. I'm not sure how much work that might be for you, though. (Could be quite a lot.)

New collation support is currently basically limited to wrappers and aliases for operating system locales. It's very basic. There's no support for filter functions, custom comparators, or any of what you'd need for true custom collations.
– Craig Ringer
Sep 7 '15 at 4:16

No, PostgreSQL does not support collations in that sense

PostgreSQL does not support collations like that (accent insensitive or not) because no comparison can return equal unless things are binary-equal. This is because internally it would introduce a lot of complexities for things like a hash index. For this reason collations in their strictest sense only affect ordering and not equality.

Workarounds

For FTS, you can define your own dictionary using unaccent,

unaccent

CREATE EXTENSION unaccent; CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple ); ALTER TEXT SEARCH CONFIGURATION mydict ALTER MAPPING FOR hword, hword_part, word WITH unaccent, simple;

Which you can then index with a functional index,

-- Just some sample data... CREATE TABLE myTable ( myCol ) AS VALUES ('fóó bar baz'),('qux quz'); -- No index required, but feel free to create one CREATE INDEX ON myTable USING GIST (to_tsvector('mydict', myCol));

You can now query it very simply

SELECT * FROM myTable WHERE to_tsvector('mydict', myCol) @@ 'foo & bar' mycol ------------- fóó bar baz (1 row)

搜尋此網誌

Sfyjdyy

Does PostgreSQL support “accent insensitive” collations?

Does PostgreSQL support “accent insensitive” collations?

3 Answers
3

No, PostgreSQL does not support collations in that sense

Workarounds

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

Does PostgreSQL support “accent insensitive” collations?

Does PostgreSQL support “accent insensitive” collations?

3 Answers 3

No, PostgreSQL does not support collations in that sense

Workarounds

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

3 Answers
3