Wednesday
Jun202012
Ask HighScalability: How do I organize millions of images?
Wednesday, June 20, 2012 at 10:30PM
Does anyone have any advice or suggestions on how to store millions of images? Currently images are stored in a MS SQL database which performance wise isn't ideal. We'd like to migrate the images over to a file system structure but I'd assume we don't just want to dump millions of images into a single directory. Besides having to contend with naming collisions, the windows filesystem might not perform optimally with that many files.
I'm assuming one approach may be to assign each user a unique CSLID, create a folder based on the CSLID and then place one users files in that particular folder. Even so, this could result in hundreds of thousands of folders. Whats the best organizational scheme/heirachy for doing this?
Reader Comments (37)
I think content addressable storage is the way to go.
Rather than using hexadecimal representation of the hash, use base32... so a two letter directory name gives 1024 possibilities.
The best option is to not save them in a relational database but use blocking (break the images into blocks) and build a system to store & retrive blocks (or use an open source lib). The advantages are better use of storage, flexibility in supporting various types/sizes & better ability to build redundancy/fault tolerance using replication at the block level. Dropbox for example will block your files and store in S3. An extreme solution is what facebook follows for all its photo storage - Check out this link - http://www.facebook.com/note.php?note_id=76191543919.
But you can get away with a simpler blocking solution based on their needs & requirements. Most of them revolve around querying, performance, monitoring / stats.
The more I think about it, using Riak (a masterless dynamo based no-sql datastore) might be a great option to try. It supports http & binary-protobuffers communication and can store arbitrary data. It is S3 like (dynamo based) and supports buckets for "logical" separation and can paths to organize data. Here's their simple api to store an object - http://wiki.basho.com/HTTP-Store-Object.html
The advantages ofd Riak are masterless & hence no SPOF, horizontally scalable (add more servers as image storage increases), great HTTP & protobuf APIs, query suppprt, secondary indexes and mapreduce query support. It should just work for you. I don't see any reason why not.
You can choose different "backends" that suit your needs, one for example is memory based which means all data will be stored in memory for high performance, and another being leveldb - nested hash storage - disk based. Great flexibility.
Finally, you can go on the other extreme (i don't recommend it but just to mention) - if you want to use this for "offline" use (not online) - use HDFS - store the images in hadoop. Hadoop will auto replicate, hence you will not loose data. This however is only for batch processing like needs. Perf is not great, but you get powerful processing and data storage options. And can grow to even save billions of images. Again, this is for offline, batch use.
--
V
alvernatively to high scalability solution,you can use zfs file system on commodity hardware, that's cheap and it can scale to a lot of disks with simple jbod. the image can be accessed simply by nfs or cifs.
zfs take care of your data with crc, snapshot, and you can do backup with send/receive that's fast and efficient. You can do a master slaves topology and scale read with more server. We use zfs system on Super micro hardware, it contains 10 millions images, on 30TB data set . Images are simply stored via a naming convention and a mysql database mapping. Mysql contain checksum ,size, resolution to do consistency checking, in fact zfs take care of consistency for us (scrubing). Image are served in http using nginx/ruby/imagemagic to enable resizing, tagging ..., using a proxy and a pool of server, serving between 10 to 30 img/s. you have to do a modulo 1000 on CSLID for exemple. As cifs exploration are very long with heavy directory structure, you can also try commercial nas product, more pricely but more better performance/integration with windows (don't think you have consistency checking with that)
Hi , I'm using a lot flat files databases and last time I was thinking about that I came to this approach
If you take first numbers , you will have a problem with that later , when the first directory 10
only will be used for a big numbers that will never rich the next directory for ex. 10015151514
Instead of taking first numbers of id or make hashes , just take the few last numbers in id .
for ex.:
75925.jpg
will be in folder
25/59/75925.jpg
in this approach you will spread all images equally across directories .
Now you have equal amount of images in each directory , so you can shard it across multiple servers.
If you think your data will grow more , use a dynamic directory structure for ex.
it will start from
25_dir/925.jpg
25_dir/59_dir/25925.jpg
and if you reach bigger ids just create another directory inside 59_dir
25_dir/59_dir/52_dir/155525925.jpg
Just try not to store more than few thousands of images in the same directory
1000 in directory , looks as optimal amount.
Use Weed FS !
http://weed-fs.googlecode.com
It is designed for this purpose! Actually it is modeled after facebook's haystack paper.
It is functioning well. But I would like more brains to work on this to make managing the system in a UI, rather than just command line.
I need a similar setup on Linux. Images are public they can be served directly from S3. Filenames are ordered but I can change them for S3 to take advantage of automated splitting by prefix. But I'm confused about this adv. only valid for private serving or not.
@Chris
Congrats man. Nice job. I have two questions first why Go and second your FS more like MogileFS than Haystack? HS is much more low-level AFAIK.
I'll write more later. Thanks for all the people in this thread.
Compared to MogileFS's 3 layer structure(dbnode, tracker, storage node), WeedFS only has 2 layer(master node, volume node). WeedFS is basically a key~file store, removing another other features not necessary. Besides storing a lot of files, it also aims to serve files fast, mostly with only one disk read operation.
Go is simple to code, similar to python. But its speed is comparable to C, with very good support for concurrent programming.
The other benefit is the deployment. Since it is compiled and statically linked, to deploy it, just copy the single file over, and start. Nothing else to install. But if you choose MogileFS, you may need to install perl, libraries, databases, etc.
I've read trough all the comments and came up with a nooby question:
Does S3 require any special folder structure to store millions of images?
May I just all but them inside one bucket, or should I create multiple buckets with each holding only a few images?
Hi,
I use an SQL database for storing metadata (original filename, extension, size, number of downloads etc), which is what SQLs are good at and for storing files I use file system, which is what file systems are good at. You can use noSQL for metadata of course, I just don't need it in my case.
I do not recommend using directory layout based on any hash function as it creates large number of directories with just couple of files in it. I discussed this with my server admin and it makes sense that file systems need to create and keep information about all of these directories. More directories mean more work when creating, searching them etc.
So what I recommend is an ID based directory layout. Use 1000 files per directory which should be fine for all filesystems including NTFS. Get the directory path based on the ID: for file ID 123456 use this /000/000/123/ with filename 123456. If you want to extend this storage, just create a "storage rule" like: for IDs from 1 to 500.000 use "storage" mounted at C:/storage1/ and for the second one D:/storage2/ etc. If you want to extend the storage to store more files for each file like thumbnails, you can create a subdirectory, example /000/000/123/123456.thumbs/ and save all thumbnails there. There will be 1000 files and 1000 thumbnail directories in each directory. Because it's so simple, you can store thumbnails (and other files) in a different directory with same directory layout.
Files are easy to replicate and backup. You can store them on one or more servers, in one or many directories, up to you.
If you need, move some of the files to another disk or server and set the range in the "storage rule" that files from 1 - 500.000 are stored on server1 and files from 500.001 to 1.000.000 or another one. It's not bulletproof as files from one range can be much more exposed than files from another range. I didn't have to solve this yet.
It also provides simple "API", you can upload or download files using any FTP or other protocol if you wish to thanks to the straightforward directory layout and file name. It's absolutely clear in which directory the file belongs to. If you need metadata, query the database and get simple XML/JSON object. The query should be fast, it doesn't have to store the files, just the metadata. The database can be on the same or another machine, use the RAM for it's speed rather than using RAM for caching files in RAM in case you store files in a database.
I use this for some (serialized) objects too in an application I develop and maintain. I'm glad the flat files are not stored in the database itself but outside of it. It makes the backup and restore fast and there's no need to store files in a system (database) running above a file system if file system itself is capable of storing files well enough.
Good luck and share your final decision, please :-)
Michal
Hi chorus does mogile fs need one to organise the files into folders and sub folders or should I just dump them and the weed will take care of it all?
This Implementation might be usefull : http://github.com/acrobit/AcroFS
Simple and super fast file system based storage library (c#)
https://github.com/acrobit/AcroFS