Wednesday
Jun202012
Ask HighScalability: How do I organize millions of images?
Wednesday, June 20, 2012 at 10:30PM
Does anyone have any advice or suggestions on how to store millions of images? Currently images are stored in a MS SQL database which performance wise isn't ideal. We'd like to migrate the images over to a file system structure but I'd assume we don't just want to dump millions of images into a single directory. Besides having to contend with naming collisions, the windows filesystem might not perform optimally with that many files.
I'm assuming one approach may be to assign each user a unique CSLID, create a folder based on the CSLID and then place one users files in that particular folder. Even so, this could result in hundreds of thousands of folders. Whats the best organizational scheme/heirachy for doing this?
Reader Comments (37)
Can we create BTree over multiple volumes? because I assume that machines will have different disk layouts and to utilize all of it and avoiding defragmentation at the same time. We have to create some form of Btree for photo storage which spans over multiple disk volumes (drives) on the box. Also Btree will have some metadata information about storing mapping of userid to folder locations.
Why not use folder name trunc(picture_id / 1000 )? or trunc(user_id /1000)/user_id/picture ?
for example 15/15050/picture.jpg
It depends from file system too.
Here's how Facebook does it: Finding a needle in Haystack: Facebook’s photo storage - http://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf
Here's one simple suggestion: compute the SHA-1 hash of the image, generate its hexadecimal form, and use the first two characters of the SHA-1 string as a first-level directory, the third and fourth characters as the second-level directory, and then place the file using the SHA-1 as the filename. SHA1 hashes give good distribution, even in the first few characters, so that will nicely distribute the files into a (relatively) balanced folder structure.
This simplistic approach will use no more than 256 folders at each level, but will result in a total of 65K second-level folders used to store the files. In an optimal world, storing 1M files means that each of these 65K second-level folders only has to store about 15 files.
Using the hexadecimal form of the image's SHA-1 has two very nice benefits: no name collisions, and any given file will only be stored once even if the same file is uploaded more than once.
The big disadvantage of this approach is that it mixes all of the images from all users into a single bucket. That may or may not be important for your use case.
Take a safe maximum number of users (say 1billion), add an integer ID and take the last 9 characters in chunks of three to create the path: i.e. your first user's files are in the \000\000\001\ folder; user 999,999,999's files are in the \999\999\999\ folder. This should keep the contents of any one folder to a fairly reasonable number.
Hope this helps.
Hi
Have you considered MongoDB GridFS?
Forgot about Scale Indefinitely On S3 With These Secrets Of The S3 Masters http://highscalability.com/blog/2012/3/7/scale-indefinitely-on-s3-with-these-secrets-of-the-s3-master.html. It's a similar problem to avoiding hot partitions in a key value database and there are numerous techniques for that.
You might also consider adding a level of indirection, keeping a separate key to location map, like DNS, so that if you need to move a file for load balancing reasons or space reasons you can do so without breaking anything.
Just store the images in S3 and add a couple of fields for bucket and key in your database. If the images aren't sensitive you can even serve them directly out of S3. If you're not comfortable with having them publicly available, then keep them private and just use S3 as a backend.
This is exactly the kind of problem that S3 was designed for.
Set up 3 linux machines, install Riak, and push all of your data into Riak.
IIRC, there was either a current or upcoming revision that allowed for S3-api work-alike access.
This solution has the benefit of being able to scale storage capacity and throughput by adding more machines, as well as being able to handle system/disk failures much better than single machines, RAID disks, etc.
keep the metdata in the database, and store the images at Akamai/S3/etc with a flat and meaningful naming convention
You gain:
- essentially unlimited scale
- not have to worry about disk spof, replication/redundancy, or disk/network access bottlenecks
- can connect to one of their cdn's internally (at least with Akamai you can)
- can access the files from multiple datacenters, including "offline" image processing that you might want to run from EC2, etc
You lose:
- Possibly performance, but you also might gain depending on your situation (user performance with CDN will be fine)
- $$ (maybe, distributed file systems that are redundant are not easy or cheap)
I don't know if it should help, it's not exactly the same case as your.
In our case we have to store 10M of pictures related to ~500K items on an Unix system.
So we store in DB only the 500K items, and we use the item's id concatened with the item's namem to compute a md5.
(Our Ids don't have a well repartition for modulo.)
And we do like Randall, we use a 2 level Folders tree by using the 2x2 first chars of the md5.
So the final path looks like : /a0/c1/id_item/[1..N].jpg
Our new system store the hash of each files in database to compute path and do integrity check later.
But we haven't enought data for the moment.
In a folder, we notice that over 1000 files/folders performance are not quite good, so 255 seems nice.
If you plan to use a Unix do not forget to do an Inode tunning of your partition, else you will get a no space left on device.
Whatever you do, please don't store the images in the file system. Yes, you will get a FS to do what you want with some of the above tricks, but really, think about it. Can you recognize and remove duplicates easily? Can you reorganize the way the images are stored? Can you do a backup, or even a hot-standby system? Not very easily! If you don't want to do offsite storage, consider staying with a database system.
We store several million images in a MySQL MyISAM DB of about 200GB - identifier, image data, hash sum. You can add whatever metadata you want. Don't know what performance you're looking for, but with very ordinary hardware, even virtualized, we get one image in a few ms, parallel fetching increases the efficiency.
Yes, maybe NTFS handles this more effeciently than ext4, reiser or xfs (which we tried), but still problems remain. With a database system, you can also use sharding or partitioning if your data becomes too large or if you need more performance. Key-value-stores like Riak could also be an option, but I don't have the expertise to comment on that.
The problem I have with your question is that I don't know what you plan to do with the images after you have stored them.
If you plan to serve the images over the web, I would recommend using Akamai/S3/etc like Brandon and Andy recommend. They have scaling built into their platforms. If you're not serving them to remote machines, then a CDN isn't necessary.
Here are some questions that come to mind:
Will you be reading these images from disk frequently?
Are you just archiving them, or processing them repeatedly?
Will you be accessing a small subset (like the last 24 hours' worth of files) much more frequently than the others, or is it an even distribution?
Are they large files (10s or 100s of megabytes), or are they tiny images that are a few kb each?
Are there any security requirements for the storage of these images?
Do you need to name the files/directories in a way that makes sense to humans, so they can navigate the directory trees manually?
Regarding performance, do you need to be able to look up and read a single file really fast? Or be able to handle lots of concurrent file reads? Or is there some other performance requirement other than "it shouldn't break"?
The solution would likely vary depending on the answers to some of those questions.
I'd just use something like Swift or MogileFS and whatever ID you already have associated with the image.
I would suggested to keep the images on a filesystem instead of a database, the comments about using riak or another db isn't going to solve the issue.
Using a filestructure with sha1 and sorts is fine, but won't distribute it in a way you want with great scale.
GridFs or another distributed filesystem could help, but that's still not the performance you are looking for.
Using S3 isn't really the choice someone wanna go with millions of images. It's alright for startups and someone scaling up in the beginning, but when you want some sort of your own solution, which reduces the price AWS won't work forever.
"Don't reinvent the wheel".
I think you are looking for S3, Rackspace Cloudfiles without the extra cost and with some more flexibility.
For that I would say OpenStack Swift (used by Rackspace Cloudfiles) is the way to go.
Easily add new capacity.
Easy scale out.
Great decentralized structure (hashing ring).
No single point of failure.
Opensource and easy to setup.
If you would start with some hundred images, you could start with Rackspace Cloudfiles and later switch to your own public or hybrid cloud solution.
Redundancy built in.
Deduplication will probably follow in next versions.
Have your service provide the url etc. to a rewritable domain, which first can be Rackspace and later switch to your own solution with OpenStack swift and serve all images from there directly. If you really need some sort of caching for peaks and OpenStack Swift should give up (unlikely), you can use a cdn or additional caching software (custommade or varnish etc.).
YouTube scale with this approach. They used their own hardware/software solution for permanent storage and used a cdn for peak loads. (disclaimer: they didn't use swift so, wasn't yet invented.)
For further questions I'll be happy to help (I'll monitor the comments).
Critism is welcome!
Disclaimer: I'm not affiliated with any of the services mentioned.
Cheers Michael
If the user has a unique id you could use these as key for your file structure for example (or hashed) like using the 3 first digits (padded by 0) and the then the rest for the child directory
user id 12 dir: 000/12
user id 123: 012/3
user 98769 dir: 987/69
user 299999 dir: 299/999
user 1234567 dir: 123/4567
and so on, with much much more users you could also build a deeper hierarchy. I also made good experiences with MogileFS https://github.com/mogilefs/ (http://danga.com/) also to have a cheap HA solution.
Keep the images in a database, the flexibility is superior.
Depending on your architecture, there's several steps you could make, but based on the info given I'd do this:
Front your image requests with code to build a cache on the host.
Once a resource (image) is requested, have that host cache it on filesystem.
Maintain a list to determine LRU (Least Recently Used).
You would then prune based off of this list to your liking: You could set up a 'TTL' of sorts, or calculate the cache size and maintain it at X level.
At a higher level of growth requirement, you could dedicate servers solely to maintain this cache and have a single list that indicates what cache server the image is on and direct requests accordingly in addition to above suggestion.
What about storing the images in HDFS and meta data in HBase? But we need a layer to abstract the small files as a single file in Hadoop.. like what HBASE does with Hadoop
I started a long comment, but did it as a blog post instead, because it was getting too long. Summary: I like the hash approach, but also consider using random numbers for folder names, and keep your number of file system entries to around 1000 per level.
+1 for OpenStack Swift, ideal if you don't want to place this with a hosted provider and have a lot of disks floating around! The added benefit being the metadata structure that is built into the objects and API natively. If you have low to moderate metadata needs, you could exclude the DB completely. More extensive metadata would require a supplementary database.
It really depends on a use case.
I use this simple naming system for a "web-picture-upload service":
/ {YYYY} / {MM} / {DD} / {MD5}-{PX-SIZE}.jpg
example:
/2012/06/22/f65e75e37561064261895fec2a6a8532-100x100.jpg
/2012/06/22/f65e75e37561064261895fec2a6a8532-75x75.jpg
- I can easily make daily incremental backups
- I can (relatively) easily find duplicate files
Of course system can be even better by splitting MD5 by first 2, 3, 4 chars as someone mentioned above.
e.g.
/2012/06/22/f6/5e/75e37561064261895fec2a6a8532-100x100.jpg
Cheers Dusan
We use a similar system to what has already been described but with a slight twist.
Say each image has an id which is a long 123456789
It has been suggested to create a directory structure 12/34/45/images-in-here.jpg
However if your id is a sequence the spread of images in folders is going to be very skewed. To avoid this we create the folders by reversing the digits. In our case it would be in folder: 89/67/45/images-in-here
The advantage of this system:
- Each folder will only have at most 99 subfolders.
- Depending on the number of images you want to store you can estimate the number of images in the leaf folders easily and adjust the depth according to your needs
- Very simple!
What about MogileFS?, Uses MySql for storing metadata and local filesystem.
Be careful about recommending that images are stored in a database? My app has 100 million + images taking up over 100TB of disk. My database is about 100GB. If I ever had to do a restore of my database I might as well just turn the stie off because it would be in maintenance mode for so long that my customers would just go away. A database is just not good at handing ridiculous amounts of unsearchable binary data.
We have recently migrated our images servers to MogileFS + Nginx and it works great. You can add nodes if you ran out of space, make MogileFS store automatically several copies of your file. It is backup by a MySQL database which you can replicate to another server to prevent failure. Give it a look.