Adventures with UTF-8 encoding in a MySQL database

I had some technical difficulties today and I wanted to thank 2019 Dagmar for helping out 2024 Dagmar. This is a repost (and update) of an answer I added to Quora back then.

Getting MySQL to play nicely with UTF-8 can be a challenge because there are so many places that you have to make sure the encoding is correct. You need to look at the following:

  • The encoding of the database
  • The encoding of the table
  • The encoding of the column
  • The encoding of the database connection

Then you also need to consider if the system can display the characters, so does the OS or browser have the correct fonts installed?

Whew. Let’s jump in.

Okay, encodings, but what is a collation?

Let’s use an analogy to explain the difference:

  • character set is the alphabet of available letters and symbols.
  • collation is the set of rules for how to sort and compare those letters.

A character set is the defined list of characters that the database knows – each character is assigned a unique number. Examples of character sets are latin1, utf8mb3 (also known as utf8) and utf8mb4. The difference between utf8mb3 and utf8mb4 is that the latter stores up to 4 bytes per character which allows it to store every character in the Unicode standard.

A collation works with a character set (it cannot exist in isolation) and it answers questions like “is A the same as a” and “is café the same as cafe”, and “does a come before Z”. Examples of collations are latin1_swedish_ci, utf8mb4_unicode_ci, or utf8mb4_0900_ai_ci.

Here’s a quick reference of the defaults used in MySQL:

MySQL VersionDefault Character SetDefault Collation
MySQL 5.7latin1latin1_swedish_ci
MySQL 8.0utf8mb4utf8mb4_0900_ai_ci

Note: when I first wrote this article, the best practice was utf8mb4_unicode_ci but now it is utf8mb4_0900_ai_ci.

Here’s a breakdown of the key differences:

Featureutf8mb4_0900_ai_ciutf8mb4_unicode_ci
MySQL VersionRecommended for MySQL 8.0+legacy, used in MySQL 5.7 and earlier
Unicode StandardBased on Unicode 9.0.0 (more modern)Based on Unicode 4.0.0 (older)
Accent SensitivityAccent-Insensitive (e.g., cafe = café)Accent-Sensitive (e.g., cafe ≠ café)
Case SensitivityCase-Insensitive (e.g., A = a)Case-Insensitive (e.g., A = a)
PerformanceGenerally faster and more optimised.Slower than the new collations.

Before we start, let’s check your current setup

Before you start changing things, it’s a good idea to see how your MySQL is currently configured. To do this, use this command:

mysql> SHOW VARIABLES WHERE variable_name LIKE 'character_set_%' OR variable_name LIKE 'collation%';

This will give you the character set and collation defaults for your database – what is used when they are not specified. The variables are grouped into four main categories: server defaults, database defaults, connection settings, and system internals.

1. Server defaults:

VariableWhat it controlsNormal setting (MySQL 8.0+)Ideal setting (MySQL 5.7)
character_set_serverThe default character set for any new database.utf8mb4utf8mb4
collation_serverThe default collation for any new database.utf8mb4_0900_ai_ciutf8mb4_unicode_ci

2. Database defaults:

VariableWhat it controlsNormal setting (MySQL 8.0+)Ideal setting (MySQL 5.7)
character_set_databaseThe default for any new table in the current database.utf8mb4utf8mb4
collation_databaseThe default collation for any new table.utf8mb4_0900_ai_ciutf8mb4_unicode_ci

3. Connection defaults:

VariableWhat it controlsNormal setting (MySQL 8.0+)Ideal setting (MySQL 5.7)
character_set_clientThe encoding of SQL queries you send to the server.utf8mb4utf8mb4
character_set_resultsThe encoding of results the server sends back to you.utf8mb4utf8mb4
character_set_connectionThe encoding the server uses for calculations/comparisons.utf8mb4utf8mb4

4. System internals, generally should not be modified:

VariableWhat it controlsNormal setting (MySQL 8.0+)Ideal setting (MySQL 5.7)
character_set_systemUsed for MySQL’s internal metadata.utf8mb3utf8
character_set_filesystemUsed for filenames (e.g., LOAD DATA INFILE).binarybinary

Generally, the out of the box settings for MySQL 8.0+ should be perfect and don’t require you to make any changes, but if you have a MySQL 5.7 server then you will need to modify these in order to get UTF-8 support.

Here is the ideal my.cnf configuration file for MySQL 5.7:

# This section applies to the standard MySQL command-line clients
[client]
default-character-set=utf8mb4
 
# This section applies specifically to the interactive MySQL command-line client
[mysql]
default-character-set=utf8mb4

# This section configures the MySQL Server (daemon) itself
[mysqld]
collation_server=utf8mb4_unicode_ci
character_set_server=utf8mb4

Note – if you change the above config then you must restart the MySQL server before they will be applied.

It is possible to modify the variables using your SQL client, which applies new variables from your next session until the server restarts. Use this syntax:

mysql> SET GLOBAL variable_name = 'value';

If you want to apply the variables just for your session (so when you logout the value you set will be reverted), remove the GLOBAL keyword, like so:

mysql> SET variable_name = 'value';

Now let’s get into more details

But important – use the right collation!

Please make sure that you use the right collation – the examples below use utf8mb4_unicode_ci but if you are using MySQL 8.0+ then you should probably be using utf8mb4_0900_ai_ci

1st check – encoding of the database

Firstly, you need a database or table that has been created with the correct encoding.

This is how to create a new database and specify the encoding:

mysql> CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 

If you already have a database, you will need to check the encoding of your database by running this command:

mysql> show create database mydb; 

Which should return output that looks like:

+--------------+-----------------------------------------------------------------------------------------------+ 
| Database     | Create Database                                                                               | 
+--------------+-----------------------------------------------------------------------------------------------+ 
| mydb         | CREATE DATABASE `mydb` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ | 
+--------------+-----------------------------------------------------------------------------------------------+ 
1 row in set (0,01 sec) 

If you see the correct encoding, you database has created correctly and you now need to check the encoding of the tables and columns that contain characters.

2nd check – encoding of the table and columns

By default the tables and columns take on the encoding from the database and table definition, but because it’s possible to override them, you must check all.

This is how you can check the encoding for a table and columns:

mysql> show create table users; 

Which should return output that looks like:

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
| Table  | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                               | 
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
| groups | CREATE TABLE `users` ( 
  `id` int(11) NOT NULL AUTO_INCREMENT, 
  `name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 
  `email` text COLLATE utf8mb4_unicode_ci, 
  `created_at` datetime NOT NULL, 
  `updated_at` datetime NOT NULL, 
  PRIMARY KEY (`id`) 
) ENGINE=InnoDB AUTO_INCREMENT=55 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci | 
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
1 row in set (0,00 sec) 

It is OK if you only see an encoding set on the table and not on the individual columns, but if the encoding for the column you are inserting UTF8 data into is anything but utf8mb4_unicode_ci then you need to update your database.

How to update an existing database

If you find that your database, tables or columns have the wrong encoding, you need to update them, one by one.

Fixing the database:

ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 

Fixing a table:

ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 

Fixing a column:

ALTER TABLE users MODIFY username varchar(255) CHARACTER SET  utf8mb4 COLLATE utf8mb4_unicode_ci; 

There are handy scripts that you can use to upgrade your database & tables, which I found on this stackoverflow article.

http://stackoverflow.com/questions/6115612/how-to-convert-an-entire-mysql-database-characterset-and-collation-to-utf-8

Here is my version:

#!/bin/bash 
set -e 
printf "### Convert MySQL db encoding to utf8mb4/utf8mb4_unicode_ci ###\n\n" 
# Get the MySQL username 
printf "Enter mysql username: " 
read -r USERNAME 
# Get the MySQL password 
printf "Enter mysql password for user %s: " "$USERNAME" 
read -rs PASSWORD 
printf "\n" 
# Get the MySQL database 
printf "Enter mysql database name: " "$DB" 
read -r DB 
( 
  echo 'ALTER DATABASE `'"$DB"'` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;' 
  mysql "$DB" -u"$USERNAME" -p"$PASSWORD" -e "SHOW TABLES" --batch --skip-column-names \ 
  | xargs -I{} echo 'SET foreign_key_checks = 0; ALTER TABLE `'{}'` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;' 
) \ 
| mysql "$DB" -u"$USERNAME" -p"$PASSWORD" \ 
&& echo "$DB database successfully updated ..." 
exit 

3rd check – encoding of the database connection

The next thing that can go wrong is that your connection to the database is using the incorrect encoding.

If you are connecting with a Java application, you may need to update your JDBC connection string to something like this:

jdbc:mysql://localhost:3306/mydb?useUnicode=true&characterEncoding=utf8mb4 

If you are connecting via another client (e.g. MySQL Workbench) or application (e.g. your Python script), then you need to check that the encoding is not being set incorrectly by that client or tool.

4th check – the fonts

Whew! If you’re still with me, there’s one last gotcha – and that’s to do with fonts.

Fonts are used to display characters (letters, numbers, symbols and emojis). Each font supports a specific range. Operating Systems come installed with a set of fonts which may not include newer UTF-8 characters. Characters are being added to UTF-8. When a character is encoded in UTF-8 but the font does not support it you may see a question mark or a box. You may also see an encoding like `\U+1F90C`.

You may also find that some UTF-8 characters are displayed in one application but not another, which could be down to different fonts being used. For example 🤌 [pinching hand] displays in my browser but not in my terminal window on my Mac.

You need to ensure that the correct font is installed in your browser, application and on your computer or server.

Testing

OK!! Once you have done the above steps and verified that your database, table and column has the correct encoding and your connection supports UTF-8, you can now try to insert data.

If you would like to try an easy unicode character, we could use this:

mysql> update users set name = 'Unicode is cool д' where id=1; 

To verify the insert happened correctly:

mysql> select name from users where id=1; 
+-----------------------+ 
| name                  | 
+-----------------------+ 
| Unicode is cool д   | 
+-----------------------+ 
1 row in set (0,07 sec) 

Bonus! UTF-8 in MySQL Docker containers

This was a new world of pain for me today.

I was running MySQL 8 in a Docker container and was struggling to get UTF-8 characters to work in the MySQL command line.

The documentation for the standard mysql:8 docker image says you can set the character set and collation when you create the Docker container, like so:

$ docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci

This didn’t solve the problem, and the character_set_client was still latin1 when I checked the MySQL variables for character set and encoding.

So I tried creating a new Docker container and setting a my.cnf configuration file when(as mentioned above)

$ docker run --name some-mysql -v /my/custom:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag

However, I still could not copy and paste UTF-8 characters into the command line:

mysql> INSERT INTO your_table (name) VALUES ('🤌');

Becomes

mysql> INSERT INTO your_table (name) VALUES ('');

I asked a question on Stackoverflow, but in the end I answered it myself.

The problem was two fold:

  1. The OS image used for the MySQL Docker containers does not have UTF-8 fonts installed
  2. The Docker container does not have the UTF-8 locale set

I created a custom MySQL 8 image using this Dockerfile, which used a Debian build to copy the systems fonts:

# Stage 1: Use Debian to install locales and fonts
FROM debian:latest as build

# Install locales and fonts
RUN apt-get update && \
    apt-get install -y locales fonts-dejavu-core && \
    locale-gen C.UTF-8 || true && \
    dpkg-reconfigure --frontend=noninteractive locales && \
    update-locale LANG=C.UTF-8

# Stage 2: Use official MySQL image as the base image
FROM mysql:8

# Set environment variables for UTF-8 locale
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8

# Copy locales and fonts from the build stage
COPY --from=build /usr/share/fonts /usr/share/fonts
COPY --from=build /usr/lib/locale /usr/lib/locale
COPY --from=build /usr/bin/locale /usr/bin/locale
COPY --from=build /usr/sbin/locale-gen /usr/sbin/locale-gen
COPY --from=build /etc/locale.gen /etc/locale.gen
COPY --from=build /usr/bin/localedef /usr/bin/localedef

# Expose MySQL port
EXPOSE 3306

# Set default command
CMD ["mysqld"]

I built the custom Docker image with this command:

docker build -t custom-mysql:8 .

I created a new Docker container using this image:

docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=root -p 9306:3306 -d custom-mysql:8

When I logged into the MySQL command line tool, I was finally able to copy and paste UTF-8 characters:

% docker exec -it some-mysql mysql -u root -proot
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.33 MySQL Community Server - GPL

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> INSERT INTO your_table (name) VALUES ('🤌');

I hope my waffling can help someone else, but at least I can be pretty sure that future me will be grateful.

Prototype using the web APIs for speech

This is a test that will demonstrate how the standard web APIs can be used to recognise voice from audio input and to read out text.

Speech Recognition API test

Speech Synthesis API test

Read more here:

  1. SpeechSynthesis API documentation
  2. SpeechRecognition API documentation
  3. Can I use: SpeechRecognition API

Code

This code was generated for me by ChatGPT. Ironically I believe the SpeechSynthesis code is actually my own StackOverflow code from when I was creating my Google Docs Add-on Ready Steady Spell.

<!DOCTYPE html>
<html>
<head>
  <title>Speech Recognition and Synthesis</title>
  <style>
    button {
      font-size: 18px;
    }
    button:disabled {
      opacity: 0.5;
      cursor: not-allowed;
    }
  </style>
</head>
<body>
  <h1>Prototype using the web APIs for speech</h1>
  <p>This is a test that will demonstrate how the standard webkit APIs can be used to recognise voice from audio input,
 and to read out text.</p>
  <ul>
    <li><a href="https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesis">SpeechSynthesis API documentation</a></li>
    <li><a href="https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition">SpeechRecognition API documentation </a></li>
    <li><a href="https://caniuse.com/speech-recognition">Can I use: SpeechRecognition API</a></li>
  </ul>

  <h2>Speech Recognition API test</h2>
  <button id="startButton">Start Recording</button>
  <button id="stopButton" disabled>Stop Recording</button>
  <div id="result" style="margin-top: 20px;font-size: 18px;"></div>

  <h2>Speech Synthesis API test</h2>
  <form id="textToSpeechForm">
    <label for="textToSpeechInput">Text:</label>
    <textarea id="textToSpeechInput"></textarea>
    <label for="voiceSelect">Voice:</label>
    <select id="voiceSelect"></select>
    <button type="submit">Speak</button>
  </form>

  <script>
    // Check browser support for SpeechRecognition and SpeechSynthesis APIs
    if ('SpeechRecognition' in window || 'webkitSpeechRecognition' in window) {
      const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
      const recognition = new SpeechRecognition();
      const synthesis = window.speechSynthesis;
      const resultDiv = document.getElementById('result');
      const textToSpeechForm = document.getElementById('textToSpeechForm');
      const textToSpeechInput = document.getElementById('textToSpeechInput');
      const voiceSelect = document.getElementById('voiceSelect');
      let finalTranscript = '';

      recognition.continuous = true;

      // Event handler for when speech is recognized
      recognition.onresult = function(event) {
        let interimTranscript = '';
        for (let i = event.resultIndex; i < event.results.length; i++) {
          const transcript = event.results[i][0].transcript;
          if (event.results[i].isFinal) {
            finalTranscript += transcript;
          } else {
            interimTranscript += transcript;
          }
        }
        resultDiv.innerHTML = `<strong>Recognised text:</strong> ${finalTranscript + interimTranscript}`;
      };

      // Event handler for button clicks
      document.getElementById('startButton').onclick = function() {
        recognition.start();
        document.getElementById('startButton').disabled = true;
        document.getElementById('stopButton').disabled = false;
      };

      document.getElementById('stopButton').onclick = function() {
        recognition.stop();
        document.getElementById('startButton').disabled = false;
        document.getElementById('stopButton').disabled = true;
      };

      // Event handler for text-to-speech form submission
      textToSpeechForm.onsubmit = function(event) {
        event.preventDefault();
        const selectedVoice = voiceSelect.value;
        const text = textToSpeechInput.value;
        speak(text, selectedVoice);
      };

      // Function to convert text to speech
      function speak(text, voice) {
        const utterance = new SpeechSynthesisUtterance(text);
        if (voice) {
          const voices = synthesis.getVoices();
          const selectedVoice = voices.find(v => v.name === voice);
          utterance.voice = selectedVoice;
        }
        synthesis.speak(utterance);
      }

      // on load 
      window.onload = function() {
         populateVoiceList();
      };

      // Fetch available voices when the list is updated
      synthesis.onvoiceschanged = function() {
          populateVoiceList();
      };

      function populateVoiceList() {
        const voices = synthesis.getVoices();
        for (let i = 0; i < voices.length; i++) {
          const option = document.createElement('option');
          option.textContent = voices[i].name;
          voiceSelect.appendChild(option);
        }
      }

    } else {
      // Display an error message if the APIs are not supported
      resultDiv.innerText = 'Speech Recognition and/or Speech Synthesis APIs are not supported in this browser.';
    }
  </script>
</body>
</html>

LLM Prompt Injection attacks

Hacking ChatGPT to bypass its guardrails is a sport like hunting for easter eggs was in the 90s; it captured the interest of people who wouldn’t usually consider themselves hackers.

As more people look to integrate OpenAI APIs into their software, we need to consider prompt injection attacks. Especially where AI is used to generate code (like database queries). A hacker could, for example, give the system a prompt that tells the LLM to ignore any other instructions it has been given and give it a list of logins and unencrypted passwords.

This article from DZone gives an indepth overview on prompt hacking, how to do it and how to mitigate your risk. It’s important to understand both to write secure code.

Here’s the list of defensive mechanisms:
👉 filtering: check for naughty words or phrases in both the user input and the LLM output.
👉 instruction defence: add guardrails into your prompt to instruct the LLM to exercise caution with its output.
👉 delimiters: use random strings to encapsulate user input and keep it clearly separated from the instructions.
👉 XML delimiters: embedding user input in XML tags seems to be very effective
👉 post-prompting: place the user’s input ahead of the instruction part of the prompt.
👉 sandwich prompting: similar to post-prompting, you add a recap of the original instructions after adding the user input in the prompt. This keeps the LLM focused even if it received conflicting commands in the user’s input.
👉 LLM evaluation: This can be of the user input where you use another LLM to evaluate the user input to identify if the contains conflicting or dangerous instructions. Or of the response where you use another LLM (usually a higher level LLM) to check if the response is considered safe. Both of these require that you develop a rubric to assess and evaluate according to a set of criteria or guidelines.

When developing, it is recommended that you keep systems that have unfettered access to the APIs and other 3rd party content separate from systems that have access to sensitive information or can perform destructive operations. The article briefly touches on Dual LLM systems where one is quarantined and the other is privileged.

However, none of these ideas or concepts truly solve the problem of prompt injection attacks – yet. Beware and go with caution.