MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems

php mysql streaming http-headers blob

Quantico773 · Jan 11, 2011 · Viewed 16.8k times · Source

Versions I am running (basically latest of everything):
PHP: 5.3.1
MySQL: 5.1.41
Apache: 2.2.14
OS: CentOS (latest)

Here is the situation.

I have thousands of very important documents, ranging from customer contracts to voice signatures (recordings of customer authorisation for contracts), with file types including, but not limited to jpg, gif, png, tiff, doc, docx, xls, wav, mp3, pdf, etc.

All of these documents are currently stored on several servers including Windows 32 bit, CentOS and Mac, among others. Some files are also stored on employees desktop computers and laptops, and some are still hard copies stored in hundreds of boxes and filing cabinets.

Now because customers or lawyers could demand evidence of contracts at any time, my company has to be able to search and locate the correct document(s) effectively, for this reason ALL of these files have to be digitised (if not already) and correlated into some sort of order for searching and accessing.

As the programmer, I have created a full Customer Relations Management tool that the whole company uses. This includes Customer Profiles management, Order and job Tracking tools, Job/sale creation and management modules, etc, and at the moment any file that is needed at a customer profile level (drivers licence, credit authority, etc) or at a job/sale level (contracts, voice signatures, etc) can be uploaded to the server and sits in a parent/child hierarchy structure, just like Windows Explorer or any other typical file managment model.

The structure appears as such:

drivers_license
|- DL_123.jpg
voice_signatures
|- VS_123.wav
|- VS_4567.wav
contracts

So the files are uplaoded using PHP and Apache, and are stored in the file system of the OS. At the time of uploading, certain information about the file(s) is stored in a MySQL database. Some of the information stored is:

TABLE: FileUploads
FileID
CustomerID (the customer id that the file belongs to, they all have this.)
JobID/SaleID (the id of the job/sale associated, if any.)
FileSize
FileType
UploadedDateTime
UploadedBy
FilePath (the directory path the file is stored in.)
FileName (current file name of uploaded file, combination of CustomerID and JobID/SaleID if applicable.)
FileDescription
OriginalFileName (original name of the source file when uploaded, including extension.)

So as you can see, the file is linked to the database by the File Name. When I want to provide a customers' files for download to a user all I have to do is "SELECT * FROM FileUploads WHERE CustomerID = 123 OR JobID = 2345;" and this will output all the file details I require, and with the FilePath and FileName I can provide the link for download.

http... server / FilePath / FileName

There are a number of problems with this method:

Storing files in this "database unconcious" environment means data integrity is not kept. If a record is deleted, the file may not be deleted also, or vice versa.
Files are strewn all over the place, different servers, computers, etc.
The file name is the ONLY thing matching the binary to the database and customer profile and customer records.

etc, etc. There are so many reasons, some of which are described here: http://www.dreamwerx.net/site/article01 . Also there is an interesting article here too: sietch.net/ViewNewsItem.aspx?NewsItemID=124 .

SO, after much research I have pretty much decided I am going to store ALL of these files in the database, as a BLOB or LONGBLOB, but there are still many considerations before I do this.

I know that storing them in the database is a viable option, however there are a number of methods of storing them. I also know storing them is one thing; correlating and accessing them in a manageable way is another thing entirely.

The article provided at this link: dreamwerx.net/site/article01 describes a way of splitting the uploaded binary files into 64kb chunks and storing each chunk with the FileID, and then streaming the actual binary file to the client using headers. This is a really cool idea since it alleviates preassure on the servers memory; instead of loading an entire 100mb file into the RAM and then sending it to the client, it is doing it 64kb at a time. I have tried this (and updated his scripts) and this is totally successful, in a very small frame of testing.

So if you are in agreeance that this method is a viable, stable and robust long-term option to store moderately large files (1kb to couple hundred megs), and large quantities of these files, let me know what other considerations or ideas you have.

Also, I am considering getting a current "File Management" PHP script that gives an interface for managing files stored in the File System and converting it to manage files stored in the database. If there is already any software out there that does this, please let me know.

I guess there are many questions I could ask, and all the information is up there ^^ so please, discuss all aspects of this and we can pass ideas back and forth and teach each other.

Cheers,

Quantico773

Answer

I work on a large software system that has done both mechanisms for storing attachments and other content. The first iteration of the system stored all data in BLOBs in the DB. I cursed it at the time. As a programmer, I could write side scripts to immediately operate on the data and change it whenever I wanted to.

Advance about 10 years and I still manage the same software but the architecture has changed and it was written with filesystem pointers. I curse it now and wish it were back in the DB. I have the added benefit of several years and having worked this application in much greater capacity in many more and many larger situations, I feel my opinion now is better educated. Promotion or system migration of the application requires extensive scripting and copying of millions of files. On one occasion we changed the OS and all the file pointers had the wrong directory separator, or the server name changes where the file was and we had to write and schedule simple SQL update statements with the DBA on the weekend to fix. Another is that the filesystem and DB records get out of sync, why is uncertain but after thousands of days of operation, sometimes non-transactional systems (filesystem and DB don't share transactional contexts) simply become out of sync. Sometimes files mysteriously go missing.

When all this was in the DB, migration or environment promotion was a matter of dump and import the DB. Row changes could be properly audited, everything in sync and logs can be replayed to point-in-time if necessary. Sure the DB gets big, but it is 2011 and this stuff is simply not a challenge for databases.

For what it is worth we had some similar issues with large data buffers when streaming some data, but A) we could pump the data in byte buffers with the Input|OutputStreams in JDBC and B) when using other tools, we wrote a stored procedure that would chunk the BLOB into a temp table and iteratively serve the chunks from the temp table. Works great.

I don't care what the technical reason for not putting this stuff in the DB, but it is so much easier to manage in a consolidated location I could double and triple the hardware or grid the DB for the time wasted by consultants and customers just in a short period of time managing the disparate files.

Update: go easy on the commenters, they're just giving their opinion on the matter.

MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems

Answer

Related questions