Software Engineering
database architecture performance file-storage
Updated Thu, 28 Jul 2022 13:12:47 GMT

For performance critical situations is storing file metadata in a database better?

As per the title of this question, for extremely performance critical situations, is storing a file's metadata (e.g. location, size, download on, etc) in a database going to allow for better performance than attempting to get it from the file system itself? Have there been any case studies into this problem?

To provide a bit more detail on a specific situation, the application needs to mirror terabytes of data (hundreds of files) between a remote site on a continual basis and the current program architecture uses Unix commands (i.e. ls) to determine which files needed to be updated. The file themselves are split between Isilon IQ clusters and Sun Thumper clusters which I have been told good throughput but poor metadata performance. As the application will be the only process to have write permissions to the files we aren't concerned with things getting out of sync, but we are concerned with performance as it currently takes between six and ten hours to transfer the data.


For actually getting an individual file's meta data I would not expect much difference, and, it would very much depend on which database went head to head with which file system and how well either was configured.

However if you say wanted to search for files with a ".mp4" suffix or all movies > 1GB then the database will win hands down. Even if the file systems index was organized to be efficiently searchable the normally available POSIX APIs would limit you to searching sequentially through a directory. If you have distributed your data over several file systems and needed a separate search on each "leaf" directory.

However this may not be the case for much longer as there are several projects (including one from Google) which are actively working on file systems with searchable meta-data

Comments (1)

  • +0 – The metadata is going to be more relevant for aggregate searches over the files as opposed to doing a stat operation on a given file. — Oct 13, 2011 at 13:06