High Scalability -

Entries by HighScalability Team (1576)

Monday

Apr192010

Strategy: Order Two Mediums Instead of Two Smalls and the EC2 Buffet

Monday, April 19, 2010 at 12:04AM

Vaibhav Puranik in Web serving in the cloud – our experiences with nginx and instance sizes describes their experience trying to maximum traffic and minimum their web serving costs on EC2. Initially they tested with two m1.small instance types and then they the switched to two c1.mediums instance types. The m1s are the standard instance types and the c1s are the high CPU instance types. Obviously the mediums have greater capability, but the cost difference was interesting:

Click to read more ...

HighScalability Team |

10 Comments |

Permalink |

Print Article

Email Article

Strategy,

amazon

Friday

Apr162010

Hot Scalability Links for April 16, 2010

Friday, April 16, 2010 at 7:15AM

Twitter gets a total of 3 billion requests a day via its API; 105,779,710 registered users; 300,000 new registered users a day; 180 million unique visitors a month; 55 million tweets a day.
Who has the most servers? Google 1 million+; Intel 100K; 1&1 Internet 70K; Facebook 30K; Akamai 61K; Rackspace 56k+.
Cloud Computing Economies of Scale. James Hamilton gives a fabulous talk breaking down where the costs are in the cloud. It's not where you may think. Higher utilization is the key. More here.
Erlang Factory: Andy Gross: Distributed Erlang Systems In Operation: Patterns and Pitfalls by Martin J. Logan. Great overview of architecting distributed systems in Erlang. Covers what you want and don't want in a distributed system and how to compromise those elements, what's common, system design, cluster membership, load balancing, upgrades, debugging, and more.
Extreme Scale Computing by Irving Wladawsky-Berger. “An exascale supercomputer capable of a million trillion calculations per second – dramatically increasing our ability to understand the world around us through simulation and slashing the time needed to design complex products such as therapeutics, advanced materials, and highly-efficient autos and aircraft.”

Click to read more ...

HighScalability Team |

Parallel Information Retrieval and Other Search Engine Goodness

Wednesday, April 14, 2010 at 8:19AM

Parallel Information Retrieval is a sample chapter in what appears to be a book-in-progress titled Information Retrieval Implementing and Evaluation Search Engines by Stefan Büttcher, Google Inc and Charles L. A. Clarke, Gordon V. Cormack, both of the University of Waterloo. The full table of contents is on-line and looks to be really interesting: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects.

Currently available is the full text of chapters: Introduction, Basic Techniques, Static Inverted Indices, Index Compression, and Parallel Information Retrieval. Parallel Information Retrieval is really meaty:

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

BigData,

Paper

Tuesday

Apr132010

Strategy: Saving Your Butt With Deferred Deletes

Tuesday, April 13, 2010 at 6:38AM

Deferred Deletes is a technique where deleted items are marked as deleted but not garbage collected until some days or preferably weeks later. James Hamilton talks describes this strategy in his classic On Designing and Deploying Internet-Scale Services:

Never delete anything. Just mark it deleted. When new data comes in, record the requests on the way. Keep a rolling two week (or more) history of all changes to help recover from software or administrative errors. If someone makes a mistake and forgets the where clause on a delete statement (it has happened before and it will again), all logical copies of the data are deleted. Neither RAID nor mirroring can protect against this form of error. The ability to recover the data can make the difference between a highly embarrassing issue or a minor, barely noticeable glitch. For those systems already doing off-line backups, this additional record of data coming into the service only needs to be since the last backup. But, being cautious, we recommend going farther back anyway.

Click to read more ...

HighScalability Team |

9 Comments |

Permalink |

Print Article

Email Article

Strategy

Monday

Apr122010

Poppen.de Architecture

Monday, April 12, 2010 at 7:53AM

This is a guest a post by Alvaro Videla describing their architecture for Poppen.de, a popular German dating site. This site is very much NSFW, so be careful before clicking on the link. What I found most interesting is how they manage to sucessfully blend a little of the old with a little of the new, using technologies like Nginx, MySQL, CouchDB, and Erlang, Memcached, RabbitMQ, PHP, Graphite, Red5, and Tsung.

What is Poppen.de?

Poppen.de (NSFW) is the top dating website in Germany, and while it may be a small site compared to giants like Flickr or Facebook, we believe it's a nice architecture to learn from if you are starting to get some scaling problems.

The Stats

2.000.000 users
20.000 concurrent users
300.000 private messages per day
250.000 logins per day
We have a team of eleven developers, two designers and two sysadmins for this project.

Click to read more ...

HighScalability Team |

21 Comments |

Permalink |

Print Article

Email Article

Example,

nosql

Friday

Apr092010

Vagrant - Build and Deploy Virtualized Development Environments Using Ruby

Friday, April 9, 2010 at 8:38AM

One of the cool things we are seeing is more tools and tool chains for performing very high level operations quite simply. Vagrant is such a tool for building and distributing virtualized development environments.

Web developers use virtual environments every day with their web applications. From EC2 and Rackspace Cloud to specialized solutions such as EngineYard and Heroku, virtualization is the tool of choice for easy deployment and infrastructure management. Vagrant aims to take those very same principles and put them to work in the heart of the application lifecycle. By providing easy to configure, lightweight, reproducible, and portable virtual machines targeted at development environments, Vagrant helps maximize your productivity and flexibility.

If you've created a build and deployment system before Vagrant does a lot of the work for you:

Click to read more ...

HighScalability Team |

4 Comments |

Permalink |

Print Article

Email Article

Product,

automation

Thursday

Apr082010

Hot Scalability Links for April 8, 2010

Thursday, April 8, 2010 at 7:46AM

Scalability porn (SFW). Real time meter for the number of ads being served by doubleclick. Amazing. A constant ~390,000 impressions a second are being served and 25 trillion since 1996. Thanks to Mike Rhoads for title idea.
Scalability? Don't worry. Application complexity? Worry by Joe McKendrick. The next challenge on enterprise agendas: application complexity. This is something that lots of hardware — whether from the cloud or internal data center — cannot fix
Leo Laporte and Steve Gibson talked about how the iPad was a denial of service attack on UPS delivery schedules. UPS trucks were filled with iPads.
Cassandra: Fact vs fiction. Jonathan Ellies puts the beatdown on Cassandra misinformation. Don't you dare say Cassandra can't work across datacenters!
JIT'd code calling conventions. Cliff Click Jr shows how Java’s calling convention can match compiled C code in speed, but allows for the flexibility of calling (code,slow) non-JIT'd code. Some assembly code required.
Stonebraker on CAP Theorem and Databases. James Hamilton: Don’t throw full consistency out too early. For many applications, it is both affordable and helps reduce application implementation errors.

Click to read more ...

HighScalability Team |

3 Comments |

Permalink |

Print Article

Email Article

hot links

Tuesday

Apr062010

Strategy: Make it Really Fast vs Do the Work Up Front

Tuesday, April 6, 2010 at 10:58AM

In Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby Peter Neubauer not only does a fantastic job explaining a complicated routing algorithm using the graph database Neo4j, but he surfaces an interesting architectural conundrum: make it really fast so work can be done on the reads or do all the work on the writes so the reads are really fast.

The money quote pointing out the competing options is:

[Being] able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services.

The poster boys for the precompute strategy is SimpleGeo, a startup that is building a "scaling infrastructure for geodata." Their strategy for handling geodata is to use Cassandra and build two clusters: one for indexes and one for records. The records cluster is a simple data lookup. The index cluster has a carefully constructed key for every lookup scenario. The indexes are computed on the write, so reads are very fast. Ad hoc queries are not allowed. You can only search on what has been precomputed.

What I think Peter is saying is because a graph database represents the problem in such a natural way and graph navigation is so fast, it becomes possible to run even large complex queries in real-time. No special infrastructure is needed.

If you are creating a geo service, which approach would you choose? Before you answer, let's first ponder: is the graph database solution really solving the same problem as SimpleGeo is solving?

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

Strategy,

graph,

nosql

Tuesday

Apr062010