Entries by Todd Hoff (380)

Wednesday
Oct032007

Save on a Load Balancer By Using Client Side Load Balancing

In Client Side Load Balancing for Web 2.0 Applications author Lei Zhu suggests a very interesting approach to load balancing: forget DNS round robbin, toss your expensive load balancer, and make your client do the load balancing for you. Your client maintains a list of possible servers and cycles through them. All the details are explained in the article, but it's an intriguing idea, especially for the budget conscious startup.

Click to read more ...

Wednesday
Oct032007

Paper: Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services

Abstract: When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.

Click to read more ...

Tuesday
Oct022007

Some Real Financial Numbers for Your Startup

If you are a startup you may find useful Guy Kawasaki's post Financial Models for Underachievers: Two Years of the Real Numbers of a Startup. Part of any business plan are the projected guestimates. They are guestimates because everyone keeps these numbers hidden like a Swiss bank account. But not Redfin. They've bravely shared their initial cost projections, their actual numbers from real life, and the lessons they've learned from the discrepancy between the two... You can find their model estimates and actuals for Rent, Per Employee, Per Month (model: $250, actual: $336); Initial Per-Employee Equipment Cost; Monthly Benefits, Per-Employee; Annual Payroll Tax; Quarterly Bonus Payout, as a % of the Total Possible; Annual Payroll Increase for Existing Employees; All-Company Meeting Cost, Per-Meeting, Per-Employee; Annual Accounting Costs, and a few more. There is also a great lessons section: Focus on headcount; Plan slow, run fast; Run top-down sanity-checks; Forget economies of scale; Admit that revenues are a mystery; Build from building blocks; Take out "hope"; Flag your assumptions; Hit $100 million in revenues within five years; Keep market-share under 20%. I find $100 million in revenues a surprisingly high number. That's a lot of money. And the underestimate for meeting costs is pretty funny. It's always those damn meetings!

Click to read more ...

Tuesday
Oct022007

Secrets to Fotolog's Scaling Success

Fotolog, a social blogging site centered around photos, grew from about 300 thousand users in 2004 to over 11 million users in 2007. Though they initially experienced the inevitable pains of rapid growth, they overcame their problems and now manage over 300 million photos and 800,000 new photos are added each day. Generating all that fabulous content are 20 million unique monthly visitors and a volunteer army of 30,000 new users each day. They did so well a very impressed suitor bought them out for a cool $90 million. That's scale meets success by anyone standards. How did they do it? Site: http://www.fotolog.com/

Information Sources

  • Scaling the World's Largest Photo Blogging Community
  • Congrats to Fotolog on $90mm sale to Hi-Media
  • Fotolog overtaking Flickr?
  • Fotolog Hits 11 Million Members and 300 Million Photos Posted
  • Site of the Week: Fotolog.com by PC Magazine
  • CEO John Borthwick's Blog.
  • DBA Frank Mash's Blog
  • Fotolog, lessons learnt by John Borthwick .

    The Platform

  • Java
  • PHP
  • Sun
  • Solaris 10
  • MySQL
  • Apache
  • Hibernate
  • Memcached
  • 3PAR (a simple, efficient and scalable tiered-storage array for utility computing)
  • IBRIX (a single namespace parallel file system, a scalable volume manager, high availability feature)
  • StrongMail
  • CDN: Akamai/Panther

    The Stats

  • Started in 2002. In 2004 they had around 300k or 400k members, 3 employees, no scalable infrastructure, and no revenue model.
  • Due to the rapid growth the site had frequent technical problems and 2005 they had to limit new free members to 1,000 a day.
  • In 2007 they had over 11 million users and were sold for $90 million to Hi-Media.
  • Members are from over 200 countries with a majority in South America. Over 20% of page views are from Europe. They rejected a US centric strategy, developing a global and engaged audience.
  • Generates over 3.5 billion page views and receives over 20 million unique visitors each month and has earned a top 20 Alexa ranking.
  • Manages over 300 Million photos and over 500,000 photos are uploaded each day.
  • Over 30,000 new members are added each day and attracts more than 4.6 million daily users. Expanded with no marketing or member incentives.
  • Over 500 user-generated communities.
  • 20% of member visit the site daily and spend an average of 24 minutes.
  • 32 MySQL servers and a 30 memcached server cluster.

    The Architecture

  • Site originally written in PHP. - Their new "Fotolog memberpage" feature is written in Java with significant performance improvement. Page is cleaner with an improved response time. - They are now serving the site on less than half the boxes they were using. - Daily registrations are up over 35% given the improved performance and a requirement to register to post a guest book message. - The new code base allows them to innovate much more on the member experience.
  • They have surpassed Flickr in popularity being a firmly Web 1.0 application. - There are no tags, no APIs, no JavaScript widgets, no Ajax. - They have a Spanish language option which extends the site to a broad user base. - They use very little text. It's mostly visual so it usable by a broad range of users. - Their interface is customizable and many people like to express their individual identities. - Their unique visitors are 1MM less than Yahoo's, yet the total minutes on the site are twice that of Yahoo and pages are 3x.
  • Revenue model: - Gold camera member for about $5/month means you can upload 6 photos a day instead of 1, have 200 comments per photo instead of 20, a custom title image for your profile, a mini-thumbnail of your most recent photo displayed next to your name in guest books, plus the possibility of having your photo featured on the front page.. - Adsense. Revenue lift from Google is trending up approximately 15% given additional contextual data from guest books. - Will move to a peer-to-peer advertising among their members. - Members will have the ability to buy and sell real and virtual items using a micro-payment service.
  • They have a one-post-per-day rule where users can only post one photo a day. Rather than inhibit growth this rules ensures quality and generates exceptional usage by increasing the chance of a photo being seen and by attracting positive comments. Where as people usually run out of things to say on a blog, people can always find a picture to take, upload, and talk about.
  • Only photos less than 2,000 kb in size can be uploaded. These are automatically resized to a 500x500 format. Pages look cleaner and load faster.
  • Model is browsing over searching. Opportunistic serendipitous treasure hunting is encouraged.
  • Friends are added automatically without needing permission. This generates an audience for your photos.
  • Supports a browse by groups feature, which have categories like "Colors" and "Emotions."
  • The site is intentionally simple. - They have resisted the temptation to add feature after feature. Instead their vision is to offer a handful of features, similar to Craig's list, the focus being on content and the conversations. - Pages need to be social. - Pages need to include not only your images, but also images from across the network, providing a visual navigation that today drives much of the time their members spend on the site, a self formed, organic distribution system, letting members see and be seen. - Complementing this social network of images are comments and guest book entries — making the experience one where media intersects with communications, day in day out, millions of images collide with billions of conversations.
  • Photobucket vs Fotolog - Photobucket stores image-based media, then distributes it to your page on social networking sites such as Myspace, Bebo, Piczo, Friendster, etc. - Fotolog is a destination. - The first generation of social-networking sites stressed self-publishing over connections (from Geocities, to Tripod to Blogger). The next generation focused mostly on connections (sixdegrees, and friendster are the classic examples here — tools to gather friends and connections, as social capital accrues in theory to the people with the most connections). The third and current generation of sites blends media with connections — each with a different emphasis.
  • Backup: Sun 6540 disk array
  • Their 32 SQL servers are divided into four clusters - user, GB (guest book), PH (photos), FF (friends and favorites lists) - Uses non-persistent connections. - Connection pooling on the Java side. - InnoDB - Partitioning is handled by the application layer.
  • Each cluster: - Is fronted by a set of application servers. - Divided into a set of shards. - Each shard has MySQL write-only master-master configuration feeding a few read-only slaves. - Application servers send their read requests to the slaves and their write requests to the masters. - Data are assigned to shared based on some sort of cluster specific partioning key. Naive partitioning algorithms can lead to very uneven shard loads, you want a more balanced load on each shard.
  • MySQL is used to store image metadata only. This seems pretty standard. Almost nobody seems to store important blobs in the database because it slows down database operations.
  • Photo storage uses 3PAR and IBRIX. A CDN is used for hot content.
  • The virtual storage system, though expensive, has worked very well.
  • As more selects are used lock contention for auto-incremented keys grows.
  • Through database optimizations they've been able to grow from 4 million members to 11 million members on the same 32 database servers. This is also do to the efficiency of MySQL running on Solaris 10, and increasing the memcache cluster, porting to Java, and increasing RAM.
  • Happy with memcached. - Created a distributed cluster of 50 memcached servers with a total cache size of approximately 150 gigabytes, supporting around 4 billion page views/month. Peak load times dropped from 10 seconds to 2 seconds. - Quote from CTO:
    I have a new memcached user to add to your list: we here at Fotolog, the world's largest photo blogging community, now use it and we love it. I just rolled our first code to use it into production today and it has been a lifesaver. I can't wait to start using it in places where we had been relying on Berkeley databases to offload some database work. We are not some wimpy million page a day site, either. Fotolog is a billion+ pages/month site (35 to 40 million views/day is pretty typical for us). We had recently overcome some significant DB-related performance issues which allowed our site traffic to explode, and it started to bog down again under the heavy traffic load (getting back up towards 10 seconds for a page to load sometimes during the peak periods). The servers were churning away each recreating a list every time when it could easily be shared in the same form for at least 5 or 10 minutes. So we introduced memcache, creating a distributed 30-server cluster with 4 gigs available in total and made a very minor code mod to use memcache, and our peak period load times dropped back down to the 2 second or so range. It has allowed for continued growth and incredible efficiency. I can't say when I've ever been so pleased with something that worked so simply."

    Lessons Learned

  • Popularity is driven by a base of active users, not a rich set of cool features.
  • The web is global and its tail is very long. By courting users outside the US with language and culturally specific design you can compete with the big boys. Some the hardest competition for Google, Yahoo, etc comes from local startups with an ear to what the locals want.
  • If you want to get a lot of buzz then do what ever alpha geeks want you to do. If you want a lot of happy users do what they want you to do.
  • Constraints in web sites can, like in poetry, make something unexpectedly better. The rule that users are only allowed to post one photo per day creates an environment where people comment more on each others photos which creates a more engaged community. Who knew?
  • Protect your website with limits. Limit the size of pictures, comments, etc so your resource usage doesn't grow outrageously.
  • Have a vision. Have a strong sense of what your site is supposed to be and why, then use that vision to decide what you should build and how you should build it. Their vision of social site built around daily photographs led to a very different site than one where your goal is to store all your photos.
  • Revenue generation features can be added without destroying the integrity of your site. I really like how they give people a reasonable set of features for free and then charge for the resources they need to have more. Those features also serve to extend and reinforce the social vision of their site. It will be interesting to see how their new monetization strategies play out.
  • Don't be afraid to scale up and out. By adding more cache, more RAM, more CPUs, and more efficient CPUs you handle dramatically more load with the same number of machines. And that's a good thing from a datacenter space and power POV.
  • Making MySQL perform: - Find the source of the problem. - Mature systems are mostly disk bound. - The query cache may be hurting you. - Add RAM to help dodge the bullet. - Stripe your disks. - Restructure tables for optimal performance. - Use libumem.so to find memory leaks.
  • Things to remember: - Know the problem - Know your application - Know your storage engine - Know your requirements - Know your budget - Use all this information to decide what parts of your system really require the investment of time, money, and testing to be highly available.

    Related Articles

  • Flickr Architecture
  • An Unorthodox Approach to Database Design : The Coming of the Shard

    Click to read more ...

  • Monday
    Oct012007

    SmugMug Found their Perfect Storage Array

    SmugMug's CEO & Chief Geek Don MacAskill smugly (hard to resist) gushes over finally finding, after a long and arduous quest, their "best bang-for-the-buck storage array." It's the Dell MD300. His in-depth explanation of why he prefers the MD3000 should help anyone with their own painful storage deliberations. His key points are: The price is right; DAS via SAS, 15 spindles at 15K rpm each, 512MB of mirrored battery-backed write cache; You can disable read caching; You can disable read-ahead prefetching; The stripe sizes are configurable up to 512KB; The controller ignores host-based flush commands by default; They support an ‘Enhanced JBOD’ mode. His reasoning for the desirability each option is astute and he even gives you the configuration options for carrying out the configuration. This is not your average CEO. Don also speculates that a three tier system using flash (system RAM + flash storage + RAID disks) is a possible future direction. Unfortunately, flash may not be the dream solution it has been thought to be. StorageMojo talks about this in Flash vs disk at DISKCON 2007.

    Click to read more ...

    Friday
    Sep282007

    Kosmos File System (KFS) is a New High End Google File System Option

    There's a new clustered file system on the spindle: Kosmos File System (KFS). Thanks to Rich Skrenta for turning me on to KFS and I think his blog post says it all. KFS is an open source project written in C++ by search startup Kosmix. The team members have a good pedigree so there's a better than average chance this software will be worth considering. After you stop trying to turn KFS into "Kentucky Fried File System" in your mind, take a look at KFS' intriguing feature set:

  • Incremental scalability: New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
  • Availability: Replication is used to provide availability due to chunk server failures. Typically, files are replicated 3-way.
  • Per file degree of replication: The degree of replication is configurable on a per file basis, with a max. limit of 64.
  • Re-replication: Whenever the degree of replication for a file drops below the configured amount (such as, due to an extended chunkserver outage), the metaserver forces the block to be re-replicated on the remaining chunk servers. Re-replication is done in the background without overwhelming the system.
  • Re-balancing: Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
  • Data integrity: To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
  • File writes: The system follows the standard model. When an application creates a file, the filename becomes part of the filesystem namespace. For performance, writes are cached at the KFS client library. Periodically, the cache is flushed and data is pushed out to the chunkservers. Also, applications can force data to be flushed to the chunkservers. In either case, once data is flushed to the server, it is available for reading.
  • Leases: KFS client library uses caching to improve performance. Leases are used to support cache consistency.
  • Chunk versioning: Versioning is used to detect stale chunks.
  • Client side fail-over: The client library is resilient to chunksever failures. During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
  • Language support: KFS client library can be accessed from C++, Java, and Python.
  • FUSE support on Linux: By mounting KFS via FUSE, this support allows existing linux utilities (such as, ls) to interface with KFS.
  • Tools: A shell binary is included in the set of tools. This allows users to navigate the filesystem tree using utilities such as, cp, ls, mkdir, rmdir, rm, mv. Tools to also monitor the chunk/meta-servers are provided.
  • Deploy scripts: To simplify launching KFS servers, a set of scripts to (1) install KFS binaries on a set of nodes, (2) start/stop KFS servers on a set of nodes are also provided. This seems to compare very favorably to GFS and is targeted at:
  • Primarily write-once/read-many workloads
  • Few millions of large files, where each file is on the order of a few tens of MB to a few tens of GB in size
  • Mostly sequential access As Rich says everyone needs to solve the "storage problem" and this looks like an exciting option to add to your bag of tricks. What we are still missing though is a Bigtable like database on top of the file system for scaling structured data. If anyone is using KFS please consider sharing your experiences.

    Related Articles

  • Hadoop
  • Google Architecture
  • You Can Now Store All Your Stuff on Your Own Google Like File System.

    Click to read more ...

  • Thursday
    Sep272007

    Product: Sequoia Database Clustering Technology

    Sequoia is a transparent middleware solution offering clustering, load balancing and failover services for any database. Sequoia is the continuation of the C-JDBC project. The database is distributed and replicated among several nodes and Sequoia balances the queries among these nodes. Sequoia handles node and network failures with transparent failover. It also provides support for hot recovery, online maintenance operations and online upgrades.

    Features in a nutshell

  • No modification of existing applications or databases.
  • Operational with any database providing a JDBC driver.
  • High availability provided by advanced RAIDb technology.
  • Transparent failover and recovery capabilities.
  • Performance scalability with unique load balancing and query result caching features.
  • Integrated JMX-based administration and monitoring.
  • 100% Java implementation allowing portability across platforms with a JRE 1.4 or greater.
  • Open source licensed under Apache v2 license.
  • Professional support, training and consulting provided by Continuent Inc∞. Sequoia is the core technology providing database clustering capabilities. It is composed of a controller implementing the RAIDb (Redundant Array of Inexpensive Databases)∞ technology. Sequoia controllers are replicated for HA and scalability purposes. Controllers use group communication to synchronize the cluster. Hedera∞ is a group communication wrapper that can be plugged to work with multiple group communication implementations such as Appia∞, JGroups or Spread. Sequoia comes with a JDBC driver for Java application. Additional drivers for PHP, Perl, ODBC∞, MySQL native API∞ and C/C++ applications are also provided through the Carob project∞. with transparent failover capabilities.

    Click to read more ...

  • Thursday
    Sep272007

    Product: Ganglia Monitoring System

    Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

    Click to read more ...

    Wednesday
    Sep262007

    Use a CDN to Instantly Improve Your Website's Performance by 20% or More 

    If you have a lot of static content to store and you aren't looking forward to setting up and maintaining your own giganto SAN, maybe you can push off a lot of the hard lifting to a CDN? Jesse Robbins at O'Reilly Radar posts that you have a lot more options now because the number of Content Distribution Networks have doubled since last year. In fact, Dan Rayburn says there are now 28 CDN providers in the market. Hopefully you can find reasonable pricing at one of them. Other than easing your burden, why might a CDN work for you? Because it makes your site faster and customers like that. How can a CDN so dramatically improve your site's performance? Steve Saunders, author of High Performance Web Sites: Essential Knowledge for Front-End Engineers, has using a CDN has one of his "Thirteen Simple Rules for Speeding Up Your Web Site." About CDNs Steve says:

    Remember that 80-90% of the end-user response time is spent downloading all the components in the page: images, stylesheets, scripts, Flash, etc. This is the Performance Golden Rule, as explained in The Importance of Front-End Performance. Rather than starting with the difficult task of redesigning your application architecture, it's better to first disperse your static content. This not only achieves a bigger reduction in response times, but it's easier thanks to content delivery networks. ... At Yahoo!, properties that moved static content off their application web servers to a CDN improved end-user response times by 20% or more. Switching to a CDN is a relatively easy code change that will dramatically improve the speed of your web site.
    It's at least worth looking into if looking for a performance boost or are concerned about storing so many buckets of bits.

    Click to read more ...

    Tuesday
    Sep182007

    Session management in highly scalable web sites

    Hi, Every application server has its own session management implementations for supporting high scalability. But an application architect/developer has to design and implement the application to make the best use of it. What are the guiding principles and pattern for session state management? Websphere System management red book mentions that "Session management performance is optimum when session data per user is around 2Kb. It degrades if session data is more than that". I have following questions. 1. How do you measure session data per user? 2. It is generally recommended that you should keep all the session state in database and keep only the keys in HttpSession object. Then everytime a web request is processed, session data is fetched from the database. This way all the data remains in memory only till the request is processed and actual data in HttpSession is very less. (Only few keys). What is the general practice? At what point you should be switching from keeping data in HttpSession to database? Are websites like Amazon or eBay follow this? 3. Is there any open source framework which helps you do session management in a way mentioned in point no. 2? Thanks, Unmesh Thanks, Unmesh

    Click to read more ...