Database Administration
postgresql update functions regular-expression
Updated Fri, 20 May 2022 09:08:28 GMT

How to map similar camelcased, pascalcased, underscored, lowerscored words to the same id?


I'm working on PostgreSQL and I have a table like this called words_table:
Table table

I need all equivalent words (like software_design and software-design) to have its map value = to its word's equivalent id_word. In other words, I want all the words softwaredesign, software_design, software.design... and so on having the same number on its map column (in this specific case, a number from 1 to 10). The same applies to civil_engineering and IndustrialDesign.

I know this involves some regular expressions and case insensitive comparisons but i'm stuck at the SQL logic. I know these expressions could be useful:

regexp_replace(word, '(\.|:|,|&|-)','','g')

To handle the separators

lower(something)

To handle the uppercase lowercase matching or

UPDATE words_table SET a.map = b.id_word WHERE word ILIKE something
... WHERE word ~* something

for case insensitive matching.

Should I create new columns with regexp_replaced words and do the mapping after that with some joins? Or maybe something with CamelCase and underscore matching? Maybe functions? Which is the optimal solution?




Solution

Database design

Either you already have the look-up table and forgot to mention it, or you should create one.

CREATE TABLE map (
  map_id int PRIMARY KEY
 ,map    text UNIQUE NOT NULL
);
-- And that's how I would shape your original table:
CREATE TABLE word (
  word_id serial PRIMARY KEY
 ,map_id  int REFERENCES map(map_id)
 ,word    text
);

Here and here is why I normally prefer text over varchar(n).
Here is why I put the two integer columns first in table word.

INSERT INTO map(map_id, map) VALUES
  (1, softwaredesign)
 ,(2, civilengineering)
 ,(3, industrialdesign)
 , ... ;

Map data

Define a function like @Daniel already suggested.

Depending on your exact requirements, I would use pre-defined character classes where possible in a regular expression instead of rolling my own. Postgres uses locale information from your OS to identify characters, digits etc.

[:alnum:] includes all numbers and digits, depending on your locale. Unlike [a-zA-Z0-9] this also identifies or as characters - depending on your locale.

[^[:alnum:]]
is the negation, i.e. all other characters.
There is also the similar class shorthand \W, but that would include the underscore _.

CREATE FUNCTION map_word(text) RETURNS text AS
$$SELECT lower(regexp_replace($1, '[^[:alnum:]]', '', 'g'))$$
LANGUAGE sql IMMUTABLE;

This update would be most efficient then:

UPDATE word w
SET    map_id = m.map_id
FROM   map m
WHERE  m.map = map_word(w.word)
AND    w.map_id IS DISTINCT FROM m.map_id -- avoid empty updates