Database Administration
mariadb collation storage character-set string-manipulation
Updated Sun, 17 Jul 2022 15:07:32 GMT

Determine Ideal Collation Set for correct data storage


Alright so I have a MariaDB DB and I've just noticed some weird storage issues, like the following:

  • My JSON columns (which in MariaDB are LONGTEXT) have an automatically set collation of utf8mb4_bin. I've just noticed that this actually messes up all of my apostrophes, by storing them as ', also stores as \u00e9, etc.

  • Some other columns hold french / spanish / portuguese etc. letters, and display as indeed. The collation it uses is utf8mb4_unicode_ci.

  • Another column which holds strings which may include HTML markup uses utf8mb4_unicode_ci, and there are no issues with the markup whatsoever. But, it for some reason escapes single quotes, so it stores ' as \', which has to be avoided.

So my question is, which collation would you ideally use to not run into any issues with the storage of HTML markup, special characters as , , , , and not escape apostrophes, in MariaDB ?

UPDATE

Actually, I really don't get it why this happens:

  • In my sample DB table, I have two columns, A and B. The Character set of the table is utf8mb4, and no column has a character set specifically assigned.

  • The table uses the default collation utf8mb4_unicode_ci

  • Column A uses collation utf8mb4_unicode_ci

  • Column B used collation utf8mb4_bin

  • Column A correctly stores letters as , , etc.

  • Column B stores them using their unicode, e.g. \u00e9 for .

I've now changed column B's collation using:

ALTER TABLE sample_table MODIFY COLUMN column_b LONGTEXT COLLATE utf8mb4_unicode_ci.

So that column A and column B now use the exact same character set + collation. Data inserted into column_b is always a JSON_OBJECT.

Still, no matter the changes, only in column B but not in A, letters like are still stored in their unicode-encoded format, and apostrophes are also stored as '. Any special letters like are correctly retrieved when querying the data (if you query the data holding sth like l\u00e9ger, you correctly get lger).

BUT, if you query sth like l'\u00e9l\u00e9phant, you don't get l'lphant, but l'lphant.

I'm getting the data in PHP, and assuring that the character set of the connection is also utf8mb4, using mysqli_set_charset( $connection, 'utf8mb4' );.

I know I could theoretically simply code a search and replace on the data retrieved; but why the heck is this apostrophe conversion still happening?

UPDATE 2

Figured out a solution for the ' problem (see my posted answer), but now I'm trying to figure out a way for the other mentioned problem: When storing "Hello, I'm James" in a LONGTEXT data field, MariaDB stores Hello, I\'m James (it escapes the single quote in it, guess for safety reasons). Currently, when I then retrieve the data, I get

Hello, I\'m James

But I wanna get

Hello, I'm James

Even if the data is stored as

Hello, I\'m James

Of course you could again do search-replace stuff in PHP, but I just feel that there must be a standard approach in MariaDB, on the DB side, for this..?

UPDATE 3

Thanks to @Rick James' hint that no data character set conversion should occur in this case on the DB server side, I've been checking my server-side code, and indeed found the first problem concerning the unicode character issue (\u00e9 instead of , etc.). The reason was that, before inserting the JSON into my DB, I've converted in from a PHP array into a JSON string using json_encode. The problem is that this escapes my unicode characters as \uXXXX by default, which should not be done in this case. To avoid this, instead of:

json_encode( $data )

Use:

json_encode( $data, JSON_UNESCAPED_UNICODE )

So that solved everything related to the unicode encoding errors. The problem concerning the weird apostrophe encoding to ' / the escaping to \' still remains unsolved.

UPDATE 4

Ok, also found the source of the ' and the \' problem, they were caused by the same problem. It was because I sanitized the string data used for the insert with:

filter_var(
  $my_string,
  FILTER_SANITIZE_STRING
);

instead of:

filter_var(
  $my_string,
  FILTER_SANITIZE_STRING,
  FILTER_FLAG_NO_ENCODE_QUOTES
);

This never showed the ' encoding when you echo the string out, probably because it's converted back to a single quote when echoed, but that's only an assumption. And yeah, doing the same where \' instead of ' was stored also solved that problem. So I guess that's it.




Solution

Under no conditions will MySQL generate these 6 characters: \u00e9 from a single character.

', also stores as \u00e9

I think that happened in your client, not MySQL.

So that column A and column B now use the exact same character set + collation

There is no problem with different columns having diff charset and/or collation. A performance problem can occur when you compare columns with different collations (especially in JOIN...ON).

l'\u00e9l\u00e9phant, you don't get l'lphant, but l'lphant

That is inconclusive. Note that display products, especially HTML, will "clean up" things for you. To really see what is in the table, use SELECT HEX(col)...

'  -- "HTML entity"
\u00e9 -- "unicode" encoding

l'lphant encoded in UTF-8 and displayed in hex (with spaces added to separate characters):

Double encoding:  6C 27 C383C2A9 6C C383C2A9 70 68 61 6E 74  
UTF-8 encoding:   6C 27   C3A9   6C   C3A9   70 68 61 6E 74 
latin1 encoding:  6C 27    E9    6C    E9    70 68 61 6E 74
text:              l  '          l          p  h  a  n  t

I'm getting the data in PHP, and ...

But where is the data coming from? mysqli_set_charset is stating that it is utf8mb4-encoded, but is it really?

code a search and replace

If you rush into this, you could be making things worse. First let's find out what is really there, where it came from, etc.

I'm

That is proper in either of these string literals:

'I\'m'
"I\'m"

The language (PHP/MySQL/etc) will remove the backslash as it parses the string. But it is 'wrong' in other contexts.

it escapes the single quote

What escapes it?? prepare+execute? real_escape? addslashes? Something else? As implied above, you do need to escape it. But we need to know what did the escaping -- to avoid messing things up further.

Even if the data is stored as Hello, I'm James ...

You should not let it store that way. That just adds to the confusion later. Ditto for ' and \u00e9. The MySQL table should contain l'lphant. I repeat, the only way to see if that is what it stored is via SELECT HEX(col) .... And expect "6C 27 C3A9 6C C3A9 70 68 61 6E 74" (minus spaces).

A test:

mysql> INSERT INTO try_json (j) VALUES ('["I\'m"]');
mysql> INSERT INTO try_json (j) VALUES ('["l\'lphant"]');
mysql> SELECT j, HEX(j), JSON_EXTRACT(j, '$[0]'), HEX(JSON_EXTRACT(j, '$[0]')) FROM try_json;
+------------------+----------------------------------+-------------------------+------------------------------+
| j                | HEX(j)                           | JSON_EXTRACT(j, '$[0]') | HEX(JSON_EXTRACT(j, '$[0]')) |
+------------------+----------------------------------+-------------------------+------------------------------+
| ["I'm"]          | 5B2249276D225D                   | "I'm"                   | 2249276D22                   |
| ["l'lphant"]   | 5B226C27C3A96CC3A97068616E74225D | "l'lphant"            | 226C27C3A96CC3A97068616E7422 |
+------------------+----------------------------------+-------------------------+------------------------------+

Usually you want this; without it, you bet the \unnnn codes:

json_encode($a, JSON_UNESCAPED_UNICODE)

Use urlencode() when you are going to put the string in a URL. That may be where %7C comes from.

PHP's htmlentities() can generate things like < and é. That last one is equivalent to '

In MySQL 8.0, you may need this technique:

select cast(unhex('224D6173746572262333393B7322') as char);

which yields "Master's" (including the quotes).

PHP and its output:

echo "<pre>";
$s = '"Master's"'; // with html entity
echo strlen($s), ' ', $s, ' ', bin2hex($s), " s - with html entity \n";
$t = '"Master's"'; // backslash and apostrophe
echo strlen($t), ' ', $t, ' ', bin2hex($t), " t - with backslash and apostrophe \n";
echo "</pre>";

14 "Master's" 224d6173746572262333393b7322 s - with html entity
10 "Master's" 224d6173746572277322 t - with backslash and apostrophe





Comments (5)

  • +0 – always a pleasure to learn from your answers. So I kept debugging a bit and indeed found out the source of the unicode encoding characters added: It's at the moment of calling json_encode() in PHP. Weirdly, I can (and need to) json_decode( $data, true ) when I receive the input on the server, and this does not let's say "unicode-corrupt" the data. I'll see if I can find out why this happens, but let me know if you may know a potential reason. And yeah, for the reason of the &#39; conversion; no luck yet. — Dec 16, 2021 at 23:19  
  • +0 – ok unicode problem solved, see updates in my answer. Next stop: The apostrophe problem. — Dec 16, 2021 at 23:34  
  • +0 – But bad news concerning the apostrophe conversion to &#39;: When checking the according json before and after the related mysqli_stmt_execute(), I get the string with the completely normal apostrophe "'". But when it's stored, I get json with all apostrophes converted to &#39;. Any idea? I'll do the HEX thing for checkup purposes as you explained, one sec. — Dec 16, 2021 at 23:40  
  • +0 – When I do HEX( col ) on a field that holds the value Master's stored as Master&#39;s, I get 224D6173746572262333393B7322. Does that help you? And when I do HEX("Master's"), I get 4D61737465722773. Any idea? To me this means that the apostrophe is being converted to &#39; at the moment of its insertion, as any moment before the insertion in php echoes out the correct apostrophes. Any clue what could be the reason for the conversion?? — Dec 16, 2021 at 23:48  
  • +1 – I do mean before, not during, the INSERT. echo will work correctly if you are using a browser -- because the HTML will turn that into a '. Do not trust debugging with echo to a web page. — Dec 17, 2021 at 02:07  


External Links

External links referenced by this document: