Extract and sort data from .mdb file using mdbtools in Python

Pasha picture Pasha · Jun 15, 2013 · Viewed 11.9k times · Source

I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.

What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.

Cycle#    Time        Data
  1      100.59        34
  1      101.34        54
  1      98.78         45  
  2      
  2
  2   ........... 

Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.

import sys, subprocess, os

DATABASE = sys.argv[1]

subprocess.call(["mdb-schema", DATABASE, "mysql"])

# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
                               stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()

print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()

# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
    if table != '':
        filename = table.replace(" ","_") + ".csv"
        file = open(filename, 'w')
        print("Dumping " + table)
        contents = subprocess.Popen(["mdb-export", DATABASE, table],
                                    stdout=subprocess.PIPE).communicate()[0]
        file.write(contents)
        file.close()

Answer

Gord Thompson picture Gord Thompson · Jun 16, 2013

Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.

A few suggestions:

  1. Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.

  2. When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.

  3. At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.

Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:

SELECT
    v1.SourceFile, 
    v1.Cycle,
    v1.Time, 
    v1.Data 
FROM 
    Voltage v1 
    INNER JOIN 
    (
        SELECT
            SourceFile, 
            Cycle, 
            MAX([Time]) AS MaxTime 
        FROM Voltage 
        GROUP BY SourceFile, Cycle
    ) v2 
        ON v1.SourceFile=v2.SourceFile 
           AND v1.Cycle=v2.Cycle 
           AND v1.Time=v2.MaxTime