Best way to store/retrieve millions of files when their meta-data is in a SQL Database

SqlRyan picture SqlRyan · Aug 10, 2009 · Viewed 7.6k times · Source

I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details:

  1. I'll have some other steps to run once a file have been generated, and there will be a few servers participating, so I'll need to watch for files as they're generated.
  2. Once generated, the files will be available though a lookup process I've written. Essentially, I'll need to pull them based on an order number, which is unique per file.
  3. At any time, an existing order number may be resubmitted, and the generated file will need to overwrite the original copy.

Originally, I had planned to write these files all to a single directory on a NAS, but I realize this might not be a good idea, since there are millions of them and Windows might not handle a million-file-lookup very gracefully. I'm looking for some advice:

  1. Is a single folder okay? The files will never be listed - they'll only be retrieved using a System.IO.File with a filename I've already determined.
  2. If I do a folder, can I watch for new files with a System.IO.DirectoryWatcher, even with that many files, or will it start to become sluggish with that many files?
  3. Should they be stored as BLOBs in a SQL Server database instead? Since I'll need to retrieve them by a reference value, maybe this makes more sense.

Thank you for your thoughts!

Answer

Bravax picture Bravax · Aug 10, 2009

To answer your questions:

  1. I wouldn't store them in a single folder. As chances are at some point you will want to look at the actual files on the disk, rather then some other way.
    Instead why not store them in seperate directories, split into batches of 1000? Possibly using the ID as a key.
  2. That many files will probably flood the DirectorWatcher, so some will be lost. I've used this in the past, and past a certain point (afew hundred), i've found it starts to miss files. Possibly use a different directory for incoming files, and then process this every so often. This can then trigger a process to update the original.
  3. I wouldn't store the documents in a database, but definately store metadata in a database.