I had some technical difficulties today and I wanted to thank 2019 Dagmar for helping out 2024 Dagmar. This is a repost (and update) of an answer I added to Quora back then.
Getting MySQL to play nicely with UTF-8 can be a challenge because there are so many places that you have to make sure the encoding is correct. You need to look at the following:
- The encoding of the database
- The encoding of the table
- The encoding of the column
- The encoding of the database connection
Then you also need to consider if the system can display the characters, so does the OS or browser have the correct fonts installed?
Whew. Let’s jump in.
Okay, encodings, but what is a collation?
Let’s use an analogy to explain the difference:
- A character set is the alphabet of available letters and symbols.
- A collation is the set of rules for how to sort and compare those letters.
A character set is the defined list of characters that the database knows – each character is assigned a unique number. Examples of character sets are latin1, utf8mb3 (also known as utf8) and utf8mb4. The difference between utf8mb3 and utf8mb4 is that the latter stores up to 4 bytes per character which allows it to store every character in the Unicode standard.
A collation works with a character set (it cannot exist in isolation) and it answers questions like “is A the same as a” and “is café the same as cafe”, and “does a come before Z”. Examples of collations are latin1_swedish_ci, utf8mb4_unicode_ci, or utf8mb4_0900_ai_ci.
Here’s a quick reference of the defaults used in MySQL:
| MySQL Version | Default Character Set | Default Collation |
|---|---|---|
| MySQL 5.7 | latin1 | latin1_swedish_ci |
| MySQL 8.0 | utf8mb4 | utf8mb4_0900_ai_ci |
Note: when I first wrote this article, the best practice was utf8mb4_unicode_ci but now it is utf8mb4_0900_ai_ci.
Here’s a breakdown of the key differences:
| Feature | utf8mb4_0900_ai_ci | utf8mb4_unicode_ci |
|---|---|---|
| MySQL Version | Recommended for MySQL 8.0+ | legacy, used in MySQL 5.7 and earlier |
| Unicode Standard | Based on Unicode 9.0.0 (more modern) | Based on Unicode 4.0.0 (older) |
| Accent Sensitivity | Accent-Insensitive (e.g., cafe = café) | Accent-Sensitive (e.g., cafe ≠ café) |
| Case Sensitivity | Case-Insensitive (e.g., A = a) | Case-Insensitive (e.g., A = a) |
| Performance | Generally faster and more optimised. | Slower than the new collations. |
Before we start, let’s check your current setup
Before you start changing things, it’s a good idea to see how your MySQL is currently configured. To do this, use this command:
mysql> SHOW VARIABLES WHERE variable_name LIKE 'character_set_%' OR variable_name LIKE 'collation%';
This will give you the character set and collation defaults for your database – what is used when they are not specified. The variables are grouped into four main categories: server defaults, database defaults, connection settings, and system internals.
1. Server defaults:
| Variable | What it controls | Normal setting (MySQL 8.0+) | Ideal setting (MySQL 5.7) |
|---|---|---|---|
| character_set_server | The default character set for any new database. | utf8mb4 | utf8mb4 |
| collation_server | The default collation for any new database. | utf8mb4_0900_ai_ci | utf8mb4_unicode_ci |
2. Database defaults:
| Variable | What it controls | Normal setting (MySQL 8.0+) | Ideal setting (MySQL 5.7) |
|---|---|---|---|
| character_set_database | The default for any new table in the current database. | utf8mb4 | utf8mb4 |
| collation_database | The default collation for any new table. | utf8mb4_0900_ai_ci | utf8mb4_unicode_ci |
3. Connection defaults:
| Variable | What it controls | Normal setting (MySQL 8.0+) | Ideal setting (MySQL 5.7) |
|---|---|---|---|
| character_set_client | The encoding of SQL queries you send to the server. | utf8mb4 | utf8mb4 |
| character_set_results | The encoding of results the server sends back to you. | utf8mb4 | utf8mb4 |
| character_set_connection | The encoding the server uses for calculations/comparisons. | utf8mb4 | utf8mb4 |
4. System internals, generally should not be modified:
| Variable | What it controls | Normal setting (MySQL 8.0+) | Ideal setting (MySQL 5.7) |
|---|---|---|---|
| character_set_system | Used for MySQL’s internal metadata. | utf8mb3 | utf8 |
| character_set_filesystem | Used for filenames (e.g., LOAD DATA INFILE). | binary | binary |
Generally, the out of the box settings for MySQL 8.0+ should be perfect and don’t require you to make any changes, but if you have a MySQL 5.7 server then you will need to modify these in order to get UTF-8 support.
Here is the ideal my.cnf configuration file for MySQL 5.7:
# This section applies to the standard MySQL command-line clients
[client]
default-character-set=utf8mb4
# This section applies specifically to the interactive MySQL command-line client
[mysql]
default-character-set=utf8mb4
# This section configures the MySQL Server (daemon) itself
[mysqld]
collation_server=utf8mb4_unicode_ci
character_set_server=utf8mb4
Note – if you change the above config then you must restart the MySQL server before they will be applied.
It is possible to modify the variables using your SQL client, which applies new variables from your next session until the server restarts. Use this syntax:
mysql> SET GLOBAL variable_name = 'value';
If you want to apply the variables just for your session (so when you logout the value you set will be reverted), remove the GLOBAL keyword, like so:
mysql> SET variable_name = 'value';
Now let’s get into more details
But important – use the right collation!
Please make sure that you use the right collation – the examples below use utf8mb4_unicode_ci but if you are using MySQL 8.0+ then you should probably be using utf8mb4_0900_ai_ci
1st check – encoding of the database
Firstly, you need a database or table that has been created with the correct encoding.
This is how to create a new database and specify the encoding:
mysql> CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
If you already have a database, you will need to check the encoding of your database by running this command:
mysql> show create database mydb;
Which should return output that looks like:
+--------------+-----------------------------------------------------------------------------------------------+
| Database | Create Database |
+--------------+-----------------------------------------------------------------------------------------------+
| mydb | CREATE DATABASE `mydb` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ |
+--------------+-----------------------------------------------------------------------------------------------+
1 row in set (0,01 sec)
If you see the correct encoding, you database has created correctly and you now need to check the encoding of the tables and columns that contain characters.
2nd check – encoding of the table and columns
By default the tables and columns take on the encoding from the database and table definition, but because it’s possible to override them, you must check all.
This is how you can check the encoding for a table and columns:
mysql> show create table users;
Which should return output that looks like:
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| groups | CREATE TABLE `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`email` text COLLATE utf8mb4_unicode_ci,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=55 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0,00 sec)
It is OK if you only see an encoding set on the table and not on the individual columns, but if the encoding for the column you are inserting UTF8 data into is anything but utf8mb4_unicode_ci then you need to update your database.
How to update an existing database
If you find that your database, tables or columns have the wrong encoding, you need to update them, one by one.
Fixing the database:
ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Fixing a table:
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Fixing a column:
ALTER TABLE users MODIFY username varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
There are handy scripts that you can use to upgrade your database & tables, which I found on this stackoverflow article.
Here is my version:
#!/bin/bash
set -e
printf "### Convert MySQL db encoding to utf8mb4/utf8mb4_unicode_ci ###\n\n"
# Get the MySQL username
printf "Enter mysql username: "
read -r USERNAME
# Get the MySQL password
printf "Enter mysql password for user %s: " "$USERNAME"
read -rs PASSWORD
printf "\n"
# Get the MySQL database
printf "Enter mysql database name: " "$DB"
read -r DB
(
echo 'ALTER DATABASE `'"$DB"'` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;'
mysql "$DB" -u"$USERNAME" -p"$PASSWORD" -e "SHOW TABLES" --batch --skip-column-names \
| xargs -I{} echo 'SET foreign_key_checks = 0; ALTER TABLE `'{}'` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;'
) \
| mysql "$DB" -u"$USERNAME" -p"$PASSWORD" \
&& echo "$DB database successfully updated ..."
exit
3rd check – encoding of the database connection
The next thing that can go wrong is that your connection to the database is using the incorrect encoding.
If you are connecting with a Java application, you may need to update your JDBC connection string to something like this:
jdbc:mysql://localhost:3306/mydb?useUnicode=true&characterEncoding=utf8mb4
If you are connecting via another client (e.g. MySQL Workbench) or application (e.g. your Python script), then you need to check that the encoding is not being set incorrectly by that client or tool.
4th check – the fonts
Whew! If you’re still with me, there’s one last gotcha – and that’s to do with fonts.
Fonts are used to display characters (letters, numbers, symbols and emojis). Each font supports a specific range. Operating Systems come installed with a set of fonts which may not include newer UTF-8 characters. Characters are being added to UTF-8. When a character is encoded in UTF-8 but the font does not support it you may see a question mark or a box. You may also see an encoding like `\U+1F90C`.
You may also find that some UTF-8 characters are displayed in one application but not another, which could be down to different fonts being used. For example 🤌 [pinching hand] displays in my browser but not in my terminal window on my Mac.
You need to ensure that the correct font is installed in your browser, application and on your computer or server.
Testing
OK!! Once you have done the above steps and verified that your database, table and column has the correct encoding and your connection supports UTF-8, you can now try to insert data.
If you would like to try an easy unicode character, we could use this:
mysql> update users set name = 'Unicode is cool д' where id=1;
To verify the insert happened correctly:
mysql> select name from users where id=1;
+-----------------------+
| name |
+-----------------------+
| Unicode is cool д |
+-----------------------+
1 row in set (0,07 sec)
Bonus! UTF-8 in MySQL Docker containers
This was a new world of pain for me today.
I was running MySQL 8 in a Docker container and was struggling to get UTF-8 characters to work in the MySQL command line.
The documentation for the standard mysql:8 docker image says you can set the character set and collation when you create the Docker container, like so:
$ docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
This didn’t solve the problem, and the character_set_client was still latin1 when I checked the MySQL variables for character set and encoding.
So I tried creating a new Docker container and setting a my.cnf configuration file when(as mentioned above)
$ docker run --name some-mysql -v /my/custom:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag
However, I still could not copy and paste UTF-8 characters into the command line:
mysql> INSERT INTO your_table (name) VALUES ('🤌');
Becomes
mysql> INSERT INTO your_table (name) VALUES ('');
I asked a question on Stackoverflow, but in the end I answered it myself.
The problem was two fold:
- The OS image used for the MySQL Docker containers does not have UTF-8 fonts installed
- The Docker container does not have the UTF-8 locale set
I created a custom MySQL 8 image using this Dockerfile, which used a Debian build to copy the systems fonts:
# Stage 1: Use Debian to install locales and fonts
FROM debian:latest as build
# Install locales and fonts
RUN apt-get update && \
apt-get install -y locales fonts-dejavu-core && \
locale-gen C.UTF-8 || true && \
dpkg-reconfigure --frontend=noninteractive locales && \
update-locale LANG=C.UTF-8
# Stage 2: Use official MySQL image as the base image
FROM mysql:8
# Set environment variables for UTF-8 locale
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8
# Copy locales and fonts from the build stage
COPY --from=build /usr/share/fonts /usr/share/fonts
COPY --from=build /usr/lib/locale /usr/lib/locale
COPY --from=build /usr/bin/locale /usr/bin/locale
COPY --from=build /usr/sbin/locale-gen /usr/sbin/locale-gen
COPY --from=build /etc/locale.gen /etc/locale.gen
COPY --from=build /usr/bin/localedef /usr/bin/localedef
# Expose MySQL port
EXPOSE 3306
# Set default command
CMD ["mysqld"]
I built the custom Docker image with this command:
docker build -t custom-mysql:8 .
I created a new Docker container using this image:
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=root -p 9306:3306 -d custom-mysql:8
When I logged into the MySQL command line tool, I was finally able to copy and paste UTF-8 characters:
% docker exec -it some-mysql mysql -u root -proot
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.33 MySQL Community Server - GPL
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> INSERT INTO your_table (name) VALUES ('🤌');
I hope my waffling can help someone else, but at least I can be pretty sure that future me will be grateful.

