How to import big spreadsheets
My adventure with importing big spreadsheets started in 2021. A client asked to import a 30GB CSV file into a MySQL database. Compressed in a zip file it was still 5GB. Getting the file from the client to the server was the first hurdle. Unzipping the file and importing is the second. It turned out that the hard disk wasn’t big enough to hold the unzipped file and the imported data at the same time. And then there was a problem with the line endings.
This post covers the lessons I learned from this adventure.
When you get this request, you first have to think about a way to reduce that amount of data. It’s just way too much to handle. We managed to import it, but it took a lot of time. Luckily the year after a large part of the data could be retrieved from an API, but it is still a 5GB CSV file we have to import every year.
My first approach (it didn’t work)
PHPOffice/PHPSpreadsheet is my favorite library for working with spreadsheets. Reading and writing spreadsheets, no matter the format, is easy with this library. But it’s not the best solution for big spreadsheets. Its memory usage is high. Using setReadDataOnly(true)
and iterators helps, but it’s still time-consuming and memory-intensive.
Removing PHP from the equation and using MySQL directly is a faster solution. Sadly the default MySQL configuration doesn’t allow it. The secure_file_priv
option should be set to a directory where the MySQL user has write access. This is a security measure to prevent MySQL from importing files from anywhere on the server. When you’re not in the position to change these settings you can’t use this approach.
Add secure_file_priv
to the my.cnf
file on your server and restart MySQL:
# my.cnf
[mysqld]
secure_file_priv = '/path/to/directory/with/CSV_files'
Enforcing business rules is also a problem. When you use PHP, you can validate the data before inserting it into the database. This is not possible when you use MySQL directly. You can only validate the data after it’s inserted.
MySQL can import CSV files
LOAD DATA
(docs) is a MySQL command that’s able to parse CSV files and insert the rows into a table. By skipping the PHP part and using MySQL directly, we saved a lot of time and memory problems.
LOAD DATA INFILE '/path/to/directory/with/CSV_files/my-big-spreadsheet.csv'
INTO TABLE my_table
(column1, column2, column3);
Easy right? But there are some things to keep in mind.
How to handle a big file?
Never “open” the file like you would do with a small file. Everything takes longer to load and memory limits aren’t far away. So vi my-big-spreadsheet.csv
is not an option.
To get a quick overview of the file you can use head
and tail
. These commands show the first and last lines of the file. You can use the -n
flag to show more lines. For example, head -n 10 my-big-spreadsheet.csv
shows the first 10 lines of the file.
You can split the CSV file into multiple files. With XML formats like ODS and XLSX you can’t do that. You can use the split
command to split the file.
split -l 1000000 my-big-spreadsheet.csv my-big-spreadsheet-
This command splits the file into multiple files with 1 million lines each. The files are named my-big-spreadsheet-00
, my-big-spreadsheet-01
, etc. The last argument is the prefix for the file names. Keep line endings in mind. More on that later.
Testing
Never take a 5GB file to test your query. Killing a query takes a lot of time. Test your query and the complete workflow with a subset of the data. You can take, for example, the first 3000 lines: head -n 3000 my-big-spreadsheet.csv > my-sample.csv
and start with that. Next, take the first million lines to do some stress testing.
But if you have to kill a query you can use SHOW PROCESSLIST
to see which queries are running and their process id. And then use KILL QUERY <id>
to kill the query.
Skip the header row
Easy peasy. Just add IGNORE 1 LINES
to the query.
LOAD DATA INFILE '/path/to/directory/with/CSV_files/my-big-spreadsheet.csv'
INTO TABLE my_table
IGNORE 1 LINES
(column1, column2, column3);
Line endings, column separator, and enclosing character
Let’s make it a bit more complicated. After importing a CSV file I noticed that only half of the rows were loaded. It turned out that the file used Windows line endings \r\n
. The MySQL ran on a Linux server, so it didn’t expect Windows line endings, but Linux line endings: \n
.
With LINES TERMINATED BY '\r\n'
you can tell MySQL which line endings to use. In this case Windows line endings.
But how to determine which line endings are used? With bash, you can use cat -e
. The -e
flag tells cat
to show end-of-line characters and non-printing characters.
echo "line 1\r\nline 2\n" | cat -e
# Output:
# line 1^M$
# line 2$
Line 1 ends with a Windows line ending which consists of a carriage return \r
shown as ^M
and a new line \n
shown as $
. And line 2 ends with a Linux line ending shown as $
.
For a 30GB CSV file, you don’t have to check the whole file. It’s enough to check the first couple of lines. If the line endings are consistent, the rest of the file will have the same line endings.
This command loads the first 10 lines of the file and shows the line endings:
head -n 10 my-big-spreadsheet.csv | cat -e
You can do the same with tail
to check the last lines of the file: tail -n 10 my-big-spreadsheet.csv
.
Now the query with the correct line endings, but also the column separator and enclosing character, in it.
LOAD DATA INFILE '/path/to/directory/with/CSV_files/my-big-spreadsheet.csv'
INTO TABLE my_table
COLUMNS TERMINATED BY ';'
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(column1, column2, column3);
Counting rows
After importing the file, you have to verify how many rows should be in the table. With wc -l my-big-spreadsheet.csv
you get the amount of rows in the file. This command can handle big files quite fast. It of course assumes that every line is a row. When that’s not the case you might run into problems with the query. But that’s a problem for another day.
Now you know the amount of rows that should be in the table. You can use SELECT COUNT(*) FROM my_table
for that. But that takes a lot of time to load. When you use an incremental primary key you can use SELECT MAX(id) FROM my_table
to get the number of rows. That is a lot faster.
Bonus tip: is the table empty?
I need an empty table before every import. So I have to verify before every import if the table is empty.
You can count the rows (see above), but when you don’t have an incremental primary key it’s time-consuming. By using the LIMIT
clause you can write a fast query:
SELECT * FROM my_table LIMIT 1;
When the table is empty, the query returns nothing. When the table is not empty, the query returns the first row.
Now you know whether the table is empty or not.