Best DataMining Database

Eric picture Eric · Apr 5, 2010 · Viewed 11.2k times · Source

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.

Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.

I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.

Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.

Many thanks!

Answer

Alfred picture Alfred · Apr 6, 2010

Quick Summary

  • You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
  • You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
  • How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.

We know:

  • You used python as your program language.
  • Your database is ~900MB (I think that's pretty large, but absolute manageable).
  • You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.

Missing information

  • Do you have a central server. You should use/have a server.
  • What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
  • How much memory does your server have?
  • Could you specify your data set(CSV) a little better.
  • What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
    • "get all users who work for departement x"
    • "get all sales from user x"

Database needed

I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.

You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.

MySQL

I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.

A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.

Tokyo Tyrant

You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).

Redis:

You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.

Document stores

This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.

Full-text-search

If you want to do full-text-search queries you could like at:

  • elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
  • solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.