I'm dealing with a lot of data in a MySQL database and I'd like to use sharding to scale out. I understand the principles of sharding, and I even know how I want to shard my data.
When I look up database sharding, I cannot find any comprehensive examples on how to actually manage and query a sharded database.
Specifically, lets say I've split up my data into multiple tables/databases (shards), what is the best way to query that data? I don't think there is a way to have mysql intelligently know which shard to use.
Are there 3rd party softwares that can manage the shards and my queries? Or do I have to change my code (which is written in php) to interface with the sharded data?
For what it's worth I've dealt with some larger systems and there was a custom in-house app that aggregated queries from servers for use in general aps for the company.
e.g. select * from t1
was transformed to:
select * from db1.t1
union
select * from db2.t2
etc.
The main problem is that if you run into is cross server joins, on large million + row systems, it can hit the network pretty hard and take a long time to process queries.
Say for instance you're doing network analysis and need to do a join on tables to determine 'links' of users' attributes.
You can end up with some odd queries that are something like (forgive the syntax):
select db1.user1.boss, db1.user1.name, db2.user.name db2.user.boss from db1 inner join on db1.user.name = db2.user.name
(e.g. get a the boss of a person, and their boss, or friends' friend etc..)
This can be a tremendous PITA when you're wanting to get good data to do chained type of queries but, for simple stats like sums, averages etc... what worked best for those guys was a nightly query that aggregated stats into a table on each server (e.g. nightlystats)..
e.g. select countif(user.datecreated>yesterday,1,0) as dailyregistered, sumif(user.quitdate)... into (the new nightly record)
.
This made the daily stats pretty trivial as counts you would just sum the total column, the average you would multiply the individual server value byt that servers total count then divide by the total total, etc., and have a pretty quick dashboard view at the high level.
We ended up doing a lot of indexing and optimization and the tricks like keeping small local tables of commonly used info was helpful to speedup queries.
For larger queries, the db guy just dumped a complete system copy on a backup system and we'd use that to process it locally during the day so as not to hit the network too hard.
There's a few tricks that can reduce this, such as have shared small tables (e.g. the main tables for users, etc non changing data etc.) that way you don't have to waste time on gathering those.
The other thing that's really helpful in practice is to aggregate sums and totals across for simple queries into nightly tables.
One last thing of interest is that the workaround for the bw issue was to have a 'back-off' timeout programmed into the inhouse 'query aggregator', what it did was time the response from a record fetch, if the time started to be delayed, it would ask for fewer records and add latency to the queries it was asking for (since it was reporting and not time sensitive this worked okay)
There are some SQL that autoscales and I recently read some article about tools (but not php) which will do some of this for you. I think they were related to cloud vm providers.
This thread also provides some tools and thoughts: MySQL sharding approaches?
If NoSQL is an option, you might consider looking at all db systems out there before going that route.
The NoSQL approach might be easier to scale depending on what you're looking for though.