High Scalability -

Friday

Apr302010

Hot Scalability Links for April 30, 2010

Friday, April 30, 2010 at 7:56AM

I Want a New Data Store. Jeremy Zawodny of Craigslist wants a new database, one that can do what it should: perform alter table operations faster, has efficient queries when most of the data is on disk and not in RAM, and matches their data that now looks more document oriented than relational. A lot of people willing to help.
Computer Science Unplugged. An extensive collection of free resources that teach principles of Computer Science such as binary numbers, algorithms and data compression through engaging games and puzzles that use cards, string, crayons and lots of running around. And it's free! Fascinating Interview with Tim Bell on teaching complex computing concepts, creating makers not just users, and how to change schools. From O'Reilly Radar.
Akamai’s Network Now Pushes Terabits of Data Every Second. Akamai handles 12 million requests per second, logs more than 500 billion requests for content per day, and sends 3.45 terabits per second of data.

Click to read more ...

HighScalability Team |

Post a Comment |

Permalink |

Print Article

Email Article

hot links

Friday

Apr302010

Behind the scenes of an online marketplace

Friday, April 30, 2010 at 6:46AM

In a presentation originally held at the 4. O2 Hosting Event in Hamburg, I spoke about the technology at a large online marketplace in Germany called Hitmeister.

Some of the topics discussed include:

what makes up a marketplace? technically
system principles
development patterns
tools philosophy
data model
hardware

I am looking forward to comments and suggestions for both the presentation and our work.

Jan Miczaika |

Post a Comment |

Permalink |

Print Article

Email Article

Thursday

Apr292010

Product: SciDB - A Science-Oriented DBMS at 100 Petabytes

Thursday, April 29, 2010 at 6:42AM

Scientists are doing it for themselves. Doing what? Databases. The idea is that most databases are designed to meet the needs of businesses, not science, so scientists are banding together at scidb.org to create their own Domain Specific Database, for science. The goal is to be able to handle datasets in the 100PB range and larger.

SciDB, Inc. is building an open source database technology product designed specifically to satisfy the demands of data-intensive scientific problems. With the advice of the world's leading scientists across a variety of disciplines including astronomy, biology, physics, oceanography, atmospheric sciences, and climatology, our computer scientists are currently designing and prototyping this technology

The scientists that are participating in our open source project believe that the SciDB database — when completed — will dramatically impact their ability to conduct their experiments faster and more efficiently and further improve the quality of life on our planet by enabling them to run experiments that were previously impossible due to the limitations of existing database systems and infrastructure. Many of the world's leading computer scientists with expertise in database systems have contributed to the design and architecture of the system to meet the needs of the world's scientists.

SciDB looks like a cool project and follows what might be considered a trend, instead of beating a general tool into submission, build a specialized tool that does what you need it to do. More details about SciDB can be found in the paper A Demonstration of SciDB: A Science-Oriented DBMS. A nice succinct poster is available summarizing the product.

Some interesting bits from the paper:

Click to read more ...

HighScalability Team |

4 Comments |

Permalink |

Print Article

Email Article

Product,

nosql

Wednesday

Apr282010

Elasticity for the Enterprise -- Ensuring Continuous High Availability in a Disaster Failure Scenario

Wednesday, April 28, 2010 at 6:32AM

Many enterprises' high-availability architecture is based on the assumption that you can prevent failure from happening by putting all your critical data in a centralized database, back it up with expensive storage, and replicate it somehow between the sites. As I argued in one of my previous posts (Why Existing Databases (RAC) are So Breakable!) many of those assumptions are broken at their core, as storage is doomed to failure just like any other device, expensive hardware doesn’t make things any better and database replication is often not enough.

Click to read more ...

Nati Shalom |

Post a Comment |

Permalink |

Print Article

Email Article

Tuesday

Apr272010

Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure

Tuesday, April 27, 2010 at 7:25AM

Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it?

That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day. So how does Dapper do what Dapper does?

Click to read more ...

HighScalability Team |

Post a Comment |

Permalink |

Print Article

Email Article

Paper,

logging

Tuesday

Apr202010

The cost of High Availability (HA) with Oracle

Monday, April 19, 2010 at 5:25AM

What's the cost of downtime to your business? $100,000 per hour, $1,000,000 or more? The recent Volcanic ash that has grounded European flights is estimated to be costing the airlines $200M a day. In the IT world, High Availability (HA) architectures allow for disaster recovery as well as uninterrupted business continuity during system failure.

This post focuses on a customer’s backend, comprised of a business application stack supported by a dozen Oracle databases. They wish to equip this infrastructure with HA features and ensure that outages do not cost business. How do we address the challenge of pricing the complete solution, with hardware, software, services and annual support?

Strategy: Order Two Mediums Instead of Two Smalls and the EC2 Buffet

Monday, April 19, 2010 at 12:04AM

Vaibhav Puranik in Web serving in the cloud – our experiences with nginx and instance sizes describes their experience trying to maximum traffic and minimum their web serving costs on EC2. Initially they tested with two m1.small instance types and then they the switched to two c1.mediums instance types. The m1s are the standard instance types and the c1s are the high CPU instance types. Obviously the mediums have greater capability, but the cost difference was interesting:

Click to read more ...

HighScalability Team |

10 Comments |

Permalink |

Print Article

Email Article

Strategy,

amazon

Friday

Apr162010

Hot Scalability Links for April 16, 2010

Friday, April 16, 2010 at 7:15AM

Twitter gets a total of 3 billion requests a day via its API; 105,779,710 registered users; 300,000 new registered users a day; 180 million unique visitors a month; 55 million tweets a day.
Who has the most servers? Google 1 million+; Intel 100K; 1&1 Internet 70K; Facebook 30K; Akamai 61K; Rackspace 56k+.
Cloud Computing Economies of Scale. James Hamilton gives a fabulous talk breaking down where the costs are in the cloud. It's not where you may think. Higher utilization is the key. More here.
Erlang Factory: Andy Gross: Distributed Erlang Systems In Operation: Patterns and Pitfalls by Martin J. Logan. Great overview of architecting distributed systems in Erlang. Covers what you want and don't want in a distributed system and how to compromise those elements, what's common, system design, cluster membership, load balancing, upgrades, debugging, and more.
Extreme Scale Computing by Irving Wladawsky-Berger. “An exascale supercomputer capable of a million trillion calculations per second – dramatically increasing our ability to understand the world around us through simulation and slashing the time needed to design complex products such as therapeutics, advanced materials, and highly-efficient autos and aircraft.”

Click to read more ...

HighScalability Team |

Post a Comment |

Permalink |

Print Article

Email Article

Wednesday

Apr142010

Parallel Information Retrieval and Other Search Engine Goodness

Wednesday, April 14, 2010 at 8:19AM

Parallel Information Retrieval is a sample chapter in what appears to be a book-in-progress titled Information Retrieval Implementing and Evaluation Search Engines by Stefan Büttcher, Google Inc and Charles L. A. Clarke, Gordon V. Cormack, both of the University of Waterloo. The full table of contents is on-line and looks to be really interesting: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects.

Currently available is the full text of chapters: Introduction, Basic Techniques, Static Inverted Indices, Index Compression, and Parallel Information Retrieval. Parallel Information Retrieval is really meaty:

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

BigData,

Paper