I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools
on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.
What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.
Cycle# Time Data
1 100.59 34
1 101.34 54
1 98.78 45
2
2
2 ...........
Here is the script I use. I am using the command python extract.py table_files.mdb.
But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.
import sys, subprocess, os
DATABASE = sys.argv[1]
subprocess.call(["mdb-schema", DATABASE, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()
# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
if table != '':
filename = table.replace(" ","_") + ".csv"
file = open(filename, 'w')
print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", DATABASE, table],
stdout=subprocess.PIPE).communicate()[0]
file.write(contents)
file.close()
Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools
, unixODBC
and pyodbc
to work together. As Pedro suggested in his comment, if you can get mdb-export
to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools
on the Linux platform.
A few suggestions:
Given the sheer number of .mdb
files (and hence .csv
files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.
When creating your target table in the new database you'll probably want to use a decimal
(as opposed to float
) data type for the [Time] column.
At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.
Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:
SELECT
v1.SourceFile,
v1.Cycle,
v1.Time,
v1.Data
FROM
Voltage v1
INNER JOIN
(
SELECT
SourceFile,
Cycle,
MAX([Time]) AS MaxTime
FROM Voltage
GROUP BY SourceFile, Cycle
) v2
ON v1.SourceFile=v2.SourceFile
AND v1.Cycle=v2.Cycle
AND v1.Time=v2.MaxTime