Loading Half a Billion Rows Into MySQL
Background
We have a legacy system in our production environment that keeps track of when a user takes an action on Causes.com (joins a Cause, recruits a friend, etc). I say legacy, but I really mean a prematurely-optimized system that I’d like to make less smart. This 500m record database is split across monthly sharded tables. Seems like a great solution to scaling (and it is)—except that we don’t need it. And based on our usage pattern (e.g. to count a user’s total number of actions, we need to do query N tables), this leads to pretty severe performance degradation issues. Even with memcache layer sitting in front of old month tables, new features keep discovering new N-query performance problems. Noticing that we have another database happily chugging along with 900 million records, I decided to migrate the existing system into a single table setup. The goals were:
- Reduce complexity. Querying one table is simpler than N tables.
- Push as much complexity as possible to the database. The wrappers around the month-sharding logic in Rails are slow and buggy.
- Increase performance. Also related to one table query being simpler than N.
Alternative Proposed Solutions
MySQL Partitioning: This was the most similar to our existing set up, since MySQL internally stores the data into different tables. We decided against it because it seemed likely that it wouldn’t be much faster than our current solution (although MySQL can internally do some optimizations to make sure you only look at tables that could possibly have data you want). And it’s still the same complexity we were looking to reduce (and would further be the only database set up in our system using partitioning).
Redis:
Not really proposed as an alternative because the full dataset won’t fit
into memory, but something we’re considering loading a subset of the data into
to answer queries that we make a lot that MySQL isn’t particularly good at
(e.g. ‘which of my friends have taken an action’ is quick using Redis’s built
in SET UNION
function). The new MySQL table might be performant enough that it
doesn’t make sense to build a fast Redis version, so we’re avoiding this as
possible premature optimization, especially with a technology we’re not as
familiar with.
Dumping the old data
MySQL provides the mysqldump
utility to allow quick dumping to disk:
1
|
|
This will produce a TSV file for each table in the database, and this is the
format that LOAD INFILE
will be able to quickly load later on.
Installing Percona 5.5
We’ll be building the new system with the latest-and-greatest in Percona databases on CentOS 6.2:
1 2 |
|
[ open bug with the compat package: https://bugs.launchpad.net/percona-server/+bug/908620 ]
Specify a directory for the InnoDB data
This isn’t exactly a performance tip, but I had to do some digging to get MySQL to store data on a different partition. The first step is to make use your my.cnf contains a
datadir = /path/to/data
directive. Make sure /path/to/data is owned by mysql:mysql
1
|
|
and run:
1
|
|
This will set up the directory structures that InnoDB uses to store data. This
is also useful if you’re aborting a failed data load and want to wipe the slate
clean (if you don’t specify a directory, /var/lib/mysql is used by default).
Just rm -rf *
the directory and run the mysql_install_db
command.
[* http://dev.mysql.com/doc/refman/5.5/en/mysql-install-db.html ]
SQL Commands to Speed up the LOAD DATA
You can tell MySQL to not enforce foreign key and uniqueness constraints:
1 2 |
|
and drop the transaction isolation guarantee to UNCOMMITTED:
1
|
|
and turn off the binlog with:
1
|
|
And when you’re done, don’t forget to turn it back on with:
1 2 3 |
|
It’s worth noting that a lot of resources will tell you to to use the “DISABLE KEYS” directive and have the indices all built once all the data has been loaded into the table. Unfortunately, InnoDB does not support this. I tried it, and while it took only a few hours to load 500m rows, the data was unusable without any indices. You could drop the indices completely and add them later, but with a table size this big I didn’t think it would help much.
Another red herring was turning off autocommit and committing after each LOAD
DATA
statement. This was effectively the same thing as autocommitting, and
manually commiting led to LOAD DATA
slowdowns a quarter of the way in.
[http://dev.mysql.com/doc/refman/5.1/en/alter-table.html]
- [<http://dev.mysql.com/doc/refman/5.1/en/alter-table.html>, search for
'DISABLE KEYS']
- [ <http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/> ]
Performance adjustments made to my.cnf
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Use LOAD DATA INFILE
This is the most optimized path toward bulk loading structured data into MySQL.
8.2.2.1. Speed of INSERT
Statements,
predicts a ~20x speedup over a bulk INSERT
(i.e. an INSERT
with thousands of
rows in a single statement). See also
8.5.4. Bulk Data Loading for InnoDB Tables,
for a few more tips.
Not only is it faster, but in my experience with this migration, the INSERT method will slow down faster than it can load data and effectively never finish (last estimate I made was 60 days, but it was still slowing down).
INFILE must be in the directory that InnoDB is storing that database
information. If MySQL is in /var/lib/mysql, then mydatabase would be in
/var/lib/mysql/mydatabase. If you don’t have access to that directory on the
server, you can use LOAD DATA LOCAL INFILE
. In my testing, putting the file in
the proper place and using LOAD DATA INFILE
increased load performance by
about 20%.
http://dev.mysql.com/doc/refman/5.5/en/load-data.html
Perform your data transformation directly in MySQL
Our old actioncredit system was unique on (MONTH(created_at), id), but the new system is going to generate new autoincrementing IDs for each records as it’s loaded in chronological order. The problem was that my 50 GB of TSV data doesn’t match up to the new schema. Some scripts I had that would use Ruby to transform the old row into the new row was laughably slow. I did some digging and found out that you can tell MySQL to (quickly) throw away the data you don’t want in the load statement itself, using parameter binding:
1 2 3 |
|
This statement is telling MySQL which fields are represented in data.csv. @throwaway is a binding parameter; and in this case we want to discard it so we’re not going to bind it. If we wanted to insert a prefix, we could execute:
1 2 3 4 |
|
and every loaded row’s `action’ column will begin with the string ‘prefix’.
Checking progress without disrupting the import
If you’re loading large data files and want to check the progress, you definitely don’t want to use `SELECT COUNT(*) FROM table’. This query will degrade as the size of the table grows and slowdown the LOAD process. Instead you can query:
1 2 3 4 5 6 7 |
|
If you want to watch/log the progress over time, you can craft a quick shell command to poll the number of rows:
1 2 3 4 5 6 7 8 |
|
The tee
will echo to STDOUT as well as to file.log
, the \G
formats the
columns in the result set as rows, and the sleep gives it a pause between
loading.
LOAD DATA chunking script
I quickly discovered that throwing a 50m row TSV file at LOAD DATA was a good way to have performance degrade to the point of not finishing. I settled on using `split’ to chunk data into one million rows per file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Wrap-up
Over the duration of this script, I saw chunk load time increase from 1m40s to around an hour per million inserts. This is however better than not finishing at all, which I wasn’t able to achieve until making all changes suggested in this post and using the aforementioned `load.sh’ script. Other tips:
- use as few indices as you can
- loading the data in sequential order not only makes the loading faster, but the resulting table will be faster
- if you can load any of the data from MySQL (instead of a flat file intermediary), it will be much faster. You can use the `INSERT INTO .. SELECT’ statement to copy data between tables quickly.
UPDATE: Since writing this article, I’ve found even faster ways to load this kind of data. You can read more in the follow-up:
Even faster: loading half a billion rows in MySQL revisited’
Thanks for reading drafts of this to Greg and Lann, two of my super-smart coworkers at Causes. Check out causes.com/jobs if this sort of work interests you!