I am interested if there is a simple way to import a mysqldump into Pandas.
I have a few small (~110MB) tables and I would like to have them as DataFrames.
I would like to avoid having to put the data back into a database since that would require installation/connection to such a data base. I have the .sql files and want to import the contained tables into Pandas. Does any module exist to do this?
If versioning matters the .sql files all list "MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)" as the system the dump was produced in.
I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first.
Edit: Below is Sample MYSQLDump File With two tables.
/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;
/*!40101 SET SQL_MODE=''*/;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;
USE `ztest`;
/*Table structure for table `food_in` */
DROP TABLE IF EXISTS `food_in`;
CREATE TABLE `food_in` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Cat` varchar(255) DEFAULT NULL,
`Item` varchar(255) DEFAULT NULL,
`price` decimal(10,4) DEFAULT NULL,
`quantity` decimal(10,0) DEFAULT NULL,
KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;
/*Data for the table `food_in` */
insert into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values
(2,'Liq','Beer','2.5000','300'),
(7,'Liq','Water','3.5000','230'),
(9,'Liq','Soda','3.5000','399');
/*Table structure for table `food_min` */
DROP TABLE IF EXISTS `food_min`;
CREATE TABLE `food_min` (
`Item` varchar(255) DEFAULT NULL,
`quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*Data for the table `food_min` */
insert into `food_min`(`Item`,`quantity`) values
('Pizza','300'),
('Hotdogs','200'),
('Beer','300'),
('Water','230'),
('Soda','399'),
('Soup','100');
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
Pandas has no native way of reading a mysqldump without it passing through a database.
There is a possible workaround, but it is in my opinion a very bad idea.
Of course you could parse the data from the mysqldump file using a preprocessor.
MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.
Using StringIO
we can read a file, process the data before it is fed to the pandas.read_csv
funcion
from StringIO import StringIO
import re
def read_dump(dump_filename, target_table):
sio = StringIO()
fast_forward = True
with open(dump_filename, 'rb') as f:
for line in f:
line = line.strip()
if line.lower().startswith('insert') and target_table in line:
fast_forward = False
if fast_forward:
continue
data = re.findall('\([^\)]*\)', line)
try:
newline = data[0]
newline = newline.strip(' ()')
newline = newline.replace('`', '')
sio.write(newline)
sio.write("\n")
except IndexError:
pass
if line.endswith(';'):
break
sio.pos = 0
return sio
Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()
import pandas as pd
food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')
df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)
Results in:
Item quantity
0 'Pizza' '300'
1 'Hotdogs' '200'
2 'Beer' '300'
3 'Water' '230'
4 'Soda' '399'
5 'Soup' '100'
and
ID Cat Item price quantity
0 2 'Liq' 'Beer' '2.5000' '300'
1 7 'Liq' 'Water' '3.5000' '230'
2 9 'Liq' 'Soda' '3.5000' '399'
This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.
It is the parsing of a mysqldump file I advice against