I'm working on PostgreSQL and I have a table like this called words_table:
I need all equivalent words (like software_design and software-design) to have its map value = to its word's equivalent id_word. In other words, I want all the words softwaredesign, software_design, software.design... and so on having the same number on its map column (in this specific case, a number from 1 to 10). The same applies to civil_engineering and IndustrialDesign.
I know this involves some regular expressions and case insensitive comparisons but i'm stuck at the SQL logic. I know these expressions could be useful:
regexp_replace(word, '(\.|:|,|&|-)','','g')
To handle the separators
lower(something)
To handle the uppercase lowercase matching or
UPDATE words_table SET a.map = b.id_word WHERE word ILIKE something
... WHERE word ~* something
for case insensitive matching.
Should I create new columns with regexp_replaced words and do the mapping after that with some joins? Or maybe something with CamelCase and underscore matching? Maybe functions? Which is the optimal solution?
Either you already have the look-up table and forgot to mention it, or you should create one.
CREATE TABLE map (
map_id int PRIMARY KEY
,map text UNIQUE NOT NULL
);
-- And that's how I would shape your original table:
CREATE TABLE word (
word_id serial PRIMARY KEY
,map_id int REFERENCES map(map_id)
,word text
);
Here and here is why I normally prefer text
over varchar(n)
.
Here is why I put the two integer
columns first in table word
.
INSERT INTO map(map_id, map) VALUES
(1, softwaredesign)
,(2, civilengineering)
,(3, industrialdesign)
, ... ;
Define a function like @Daniel already suggested.
Depending on your exact requirements, I would use pre-defined character classes where possible in a regular expression instead of rolling my own. Postgres uses locale information from your OS to identify characters, digits etc.
[:alnum:]
includes all numbers and digits, depending on your locale. Unlike [a-zA-Z0-9]
this also identifies or
as characters - depending on your locale.
[^[:alnum:]]
is the negation, i.e. all other characters.\W
, but that would include the underscore _
.
CREATE FUNCTION map_word(text) RETURNS text AS
$$SELECT lower(regexp_replace($1, '[^[:alnum:]]', '', 'g'))$$
LANGUAGE sql IMMUTABLE;
This update would be most efficient then:
UPDATE word w
SET map_id = m.map_id
FROM map m
WHERE m.map = map_word(w.word)
AND w.map_id IS DISTINCT FROM m.map_id -- avoid empty updates
External links referenced by this document:
Local articles referenced by this article: