Adventures with UTF-8 encoding in a MySQL database

I had some technical difficulties today and I wanted to thank 2019 Dagmar for helping out 2024 Dagmar. This is a repost (and update) of an answer I added to Quora back then.

Getting MySQL to play nicely with UTF-8 can be a challenge because there are so many places that you have to make sure the encoding is correct. You need to look at the following:

  • The encoding of the database
  • The encoding of the table
  • The encoding of the column
  • The encoding of the database connection
  • OS fonts

Whew. Let’s jump in.

Encodings in depth

1. Encoding of the database

Firstly, you need a database or table that has been created with the correct encoding. MySQL 5.7 default encoding is latin1 but thankfully MySQL 8 (and later) is utf8mb4.

This is how to create a new database and specify the encoding:

mysql> CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 

If you already have a database, you will need to check the encoding of your database by running this command:

mysql> show create database mydb; 

Which should return output that looks like:

+--------------+-----------------------------------------------------------------------------------------------+ 
| Database     | Create Database                                                                               | 
+--------------+-----------------------------------------------------------------------------------------------+ 
| mydb         | CREATE DATABASE `mydb` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ | 
+--------------+-----------------------------------------------------------------------------------------------+ 
1 row in set (0,01 sec) 

If you see the correct encoding, you database has created correctly and you now need to check the encoding of the tables and columns that contain characters (which needs to be done for each table in the database).

2. Encoding of the table and columns

You can now check the encoding of the table

mysql> show create table users; 

Which should return output that looks like:

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
| Table  | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                               | 
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
| groups | CREATE TABLE `users` ( 
  `id` int(11) NOT NULL AUTO_INCREMENT, 
  `name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 
  `email` text COLLATE utf8mb4_unicode_ci, 
  `created_at` datetime NOT NULL, 
  `updated_at` datetime NOT NULL, 
  PRIMARY KEY (`id`) 
) ENGINE=InnoDB AUTO_INCREMENT=55 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci | 
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
1 row in set (0,00 sec) 

It is OK if you only see an encoding set on the table and not on the individual columns, but if the encoding for the column you are inserting UTF8 data into is anything but utf8mb4_unicode_ci then you need to upgrade your database.

3. Upgrading an existing database

If you find that your database, tables or columns have the wrong encoding, you need to update them, one by one.

Fixing the database:

ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 

Fixing a table:

ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 

Fixing a column:

ALTER TABLE users MODIFY username varchar(255) CHARACTER SET  utf8mb4 COLLATE utf8mb4_unicode_ci; 

There are handy scripts that you can use to upgrade your database & tables, which I found on this stackoverflow article.

http://stackoverflow.com/questions/6115612/how-to-convert-an-entire-mysql-database-characterset-and-collation-to-utf-8

Here is my version:

#!/bin/bash 
set -e 
printf "### Convert MySQL db encoding to utf8mb4/utf8mb4_unicode_ci ###\n\n" 
# Get the MySQL username 
printf "Enter mysql username: " 
read -r USERNAME 
# Get the MySQL password 
printf "Enter mysql password for user %s: " "$USERNAME" 
read -rs PASSWORD 
printf "\n" 
# Get the MySQL database 
printf "Enter mysql database name: " "$DB" 
read -r DB 
( 
  echo 'ALTER DATABASE `'"$DB"'` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;' 
  mysql "$DB" -u"$USERNAME" -p"$PASSWORD" -e "SHOW TABLES" --batch --skip-column-names \ 
  | xargs -I{} echo 'SET foreign_key_checks = 0; ALTER TABLE `'{}'` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;' 
) \ 
| mysql "$DB" -u"$USERNAME" -p"$PASSWORD" \ 
&& echo "$DB database successfully updated ..." 
exit 

4. Encoding of the database connection

The next thing that can go wrong is that your connection to the database is using the incorrect encoding.

Generally, since the upgrade to MySQL 8, this should no longer be required, but you may need to update your connection string to something like this:

jdbc:mysql://localhost:3306/mydb?useUnicode=true&characterEncoding=utf8 

If you are connecting via the MySQL command line, then you will need to set some variables. You can either do that for the session by modifying them in the MySQL command line, or for your whole database in your MySQL configuration files.

I used the following settings in the my.cnf settings file to get my MySQL client to work for UTF-8 characters:

[client] 
default-character-set = utf8mb4 
 
[mysql] 
default-character-set = utf8mb4 
 
[mysqld] 
collation_server=utf8mb4_unicode_ci 
character_set_server=utf8mb4

5. Handling OS Fonts

Whew! If you’re still with me, there’s one last gotcha – and that’s to do with fonts.

Fonts are used to display characters (letters, numbers, symbols and emojis). Each font supports a specific range. Operating Systems come installed with a set of fonts which may not include newer UTF-8 characters. Characters are being added to UTF-8. When a character is encoded in UTF-8 but the font does not support it you may see a question mark or a box. You may also see an encoding like `\U+1F90C`.

You may also find that some UTF-8 characters are displayed in one application but not another, which could be down to different fonts being used. For example 🤌 [pinching hand] displays in my browser but not in my terminal window on my Mac.

Testing and Troubleshooting

OK!! Once you have done the above steps and verified that your database, table and column has the correct encoding and your connection supports UTF-8, you can now try to insert data.

If you would like to try another easier unicode character, we could use this:

mysql> update users set name = 'Unicode is cool д' where id=1; 

To verify the insert happened correctly:

mysql> select name from users where id=1; 
+-----------------------+ 
| name                  | 
+-----------------------+ 
| Unicode is cool д   | 
+-----------------------+ 
1 row in set (0,07 sec) 

If you’re still struggling, this command is pretty handy to verify the various encoding related variables:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%'; 
+--------------------------+--------------------+ 
| Variable_name            | Value              | 
+--------------------------+--------------------+ 
| character_set_client     | utf8mb4            | 
| character_set_connection | utf8mb4            | 
| character_set_database   | utf8               | 
| character_set_filesystem | binary             | 
| character_set_results    | utf8mb4            | 
| character_set_server     | utf8mb4            | 
| character_set_system     | utf8               | 
| collation_connection     | utf8mb4_unicode_ci | 
| collation_database       | utf8_unicode_ci    | 
| collation_server         | utf8mb4_unicode_ci | 
+--------------------------+--------------------+ 
10 rows in set (0,05 sec) 

You can use the following syntax to set the above variables directly in your MySQL command line client (or before executing a statement from your application even):

globally:

mysql> set global character_set_client='utf8mb4'; 

or just for the session (so when you logout the value you set will be reverted):

mysql> set character_set_client='utf8mb4'; 

UTF-8 in MySQL Docker containers

This was a new world of pain for me today.

I am running MySQL 8 in a Docker container and I was struggling to get UTF-8 characters to work in the MySQL command line.

The documentation for the standard mysql:8 docker image says you can set the character set and collation when you create the Docker container, like so:

$ docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci

This didn’t solve the problem, and the character_set_client was still latin1 when I checked the MySQL variables for character set and encoding.

So I tried creating a new Docker container and setting a my.cnf configuration file when(as mentioned above)

$ docker run --name some-mysql -v /my/custom:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag

However, I still could not copy and paste UTF-8 characters into the command line:

mysql> INSERT INTO your_table (name) VALUES ('🤌');

Becomes

mysql> INSERT INTO your_table (name) VALUES ('');

I asked a question on Stackoverflow, but in the end I answered it myself.

The problem was two fold:

  1. The OS image used for the MySQL Docker containers does not have UTF-8 fonts installed
  2. The Docker container does not have the UTF-8 locale set

I created a custom MySQL 8 image using this Dockerfile, which used a Debian build to copy the systems fonts:

# Stage 1: Use Debian to install locales and fonts
FROM debian:latest as build

# Install locales and fonts
RUN apt-get update && \
    apt-get install -y locales fonts-dejavu-core && \
    locale-gen C.UTF-8 || true && \
    dpkg-reconfigure --frontend=noninteractive locales && \
    update-locale LANG=C.UTF-8

# Stage 2: Use official MySQL image as the base image
FROM mysql:8

# Set environment variables for UTF-8 locale
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8

# Copy locales and fonts from the build stage
COPY --from=build /usr/share/fonts /usr/share/fonts
COPY --from=build /usr/lib/locale /usr/lib/locale
COPY --from=build /usr/bin/locale /usr/bin/locale
COPY --from=build /usr/sbin/locale-gen /usr/sbin/locale-gen
COPY --from=build /etc/locale.gen /etc/locale.gen
COPY --from=build /usr/bin/localedef /usr/bin/localedef

# Expose MySQL port
EXPOSE 3306

# Set default command
CMD ["mysqld"]

I built the custom Docker image with this command:

docker build -t custom-mysql:8 .

I created a new Docker container using this image:

docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=root -p 9306:3306 -d custom-mysql:8

When I logged into the MySQL command line tool, I was finally able to copy and paste UTF-8 characters:

% docker exec -it some-mysql mysql -u root -proot
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.33 MySQL Community Server - GPL

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> INSERT INTO your_table (name) VALUES ('🤌');

I hope my waffling can help someone else, but at least I can be pretty sure that future me will be grateful.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.