High Scalability -

Entries in Paper (127)

Wednesday

Apr142010

Parallel Information Retrieval and Other Search Engine Goodness

Wednesday, April 14, 2010 at 8:19AM

Parallel Information Retrieval is a sample chapter in what appears to be a book-in-progress titled Information Retrieval Implementing and Evaluation Search Engines by Stefan Büttcher, Google Inc and Charles L. A. Clarke, Gordon V. Cormack, both of the University of Waterloo. The full table of contents is on-line and looks to be really interesting: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects.

Currently available is the full text of chapters: Introduction, Basic Techniques, Static Inverted Indices, Index Compression, and Parallel Information Retrieval. Parallel Information Retrieval is really meaty:

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

BigData,

Paper

Monday

Apr052010

Intercloud: How Will We Scale Across Multiple Clouds?

Monday, April 5, 2010 at 6:47AM

In A Brief History of the Internet it was revealed that the Internet was based on the idea that there would be multiple independent networks of rather arbitrary design. The Internet as we now know it embodies a key underlying technical idea, namely that of open architecture networking. In this approach, the choice of any individual network technology was not dictated by a particular network architecture but rather could be selected freely by a provider and made to interwork with the other networks through a meta-level "Internetworking Architecture".

With the cloud we are in the same situation today, just a layer or two higher up the stack. We have independent clouds that we would like to connect and work seamlessly together, preferably with the ease at which we currently connect nodes to a network and networks to the Internet. This technology seems to be called the Intercloud: an interconnected global "cloud of clouds" as apposed to the Internet which is a "network of networks."

Vinton Cerf, person most often called the father of the Internet, says in Cloud Computing and the Internet that we are ripe for an Intercloud in the same way we were once ripe for the Internet:

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

Paper,

cloud

Thursday

Oct292009

Paper: No Relation: The Mixed Blessings of Non-Relational Databases

Thursday, October 29, 2009 at 9:14AM

This excellent survey of the field was written by Ian Thomas Varley as part of his Master of Science in Engineering program.

The aim of this paper is to explore the conceptual design space of non-relational databases as compared to traditional relational databases. It is clear that the design needs of the two paradigms are different, but how fundamental are the differences, and what strategies can we use to transition our conceptual designs from one to the other?

There are a few things to like about this paper. A running a example is used to show the different ways to model data depending on which type of solution you are targeting, especially covering how many-to-many relationships are modeled, data integrity, and how to support optional attributes. There's also a brief survey of some of the major systems.

The most interesting section of the report is where it tackles the problem of design for non-relational systems. The approach has two different phases: design questions and design strategies.

The questions you should ask yourself about your problem are:

Click to read more ...

HighScalability Team |

5 Comments |

Permalink |

Print Article

Email Article

Paper,

key-value store,

nosql

Thursday

Oct012009

Moving Beyond End-to-End Path Information to Optimize CDN Performance

Thursday, October 1, 2009 at 8:36AM

You go through the expense of installing CDNs all over the globe to make sure users always have a node close by and you notice something curious and furious: clients still experience poor latencies. What's up with that? What do you do to find the problem? If you are Google you build a tool (WhyHigh) to figure out what's up. This paper is about the tool and the unexpected problem of high latencies on CDNs. The main problems they found: inefficient routing to nearby nodes and packet queuing. But more useful is the architecture of WhyHigh and how it goes about identifying bottle necks. And even more useful is the general belief in creating sophisticated tools to understand and improve your service. That's what professionals do. From the abstract:

Replicating content across a geographically distributed set of servers and redirecting clients to the closest server in terms of latency has emerged as a common paradigm for improving client performance. In this paper, we analyze latencies measured from servers in Google’s content distribution network (CDN) to clients all across the Internet to study the effectiveness of latency-based server selection. Our main result is that redirecting every client to the server with least latency does not suffice to optimize client latencies. First, even though most clients are served by a geographically nearby CDN node, a sizeable fraction of clients experience latencies several tens of milliseconds higher than other clients in the same region. Second, we find that queueing delays often override the benefits of a client interacting with a nearby server.

To help the administrators of Google’s CDN cope with these problems, we have built a system called WhyHigh. First, WhyHigh measures client latencies across all nodes in the CDN and correlates measurements to identify the prefixes affected by inflated latencies. Second, since clients in several thousand prefixes have poor latencies, WhyHigh prioritizes problems based on the impact that solving them would have, e.g., by identifying either an AS path common to several inflated prefixes or a CDN node where path inflation is widespread. Finally, WhyHigh diagnoses the causes for inflated latencies using active measurements such as traceroutes and pings, in combination with datasets such as BGP paths and flow records. Typical causes discovered include lack of peering, routing misconfigurations, and side-effects of traffic engineering. We have used WhyHigh to diagnose several instances of inflated latencies, and our efforts over the course of a year have significantly helped improve the performance offered to clients by Google’s CDN.

Product: Akamai

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

CDN,

Paper

Wednesday

Sep162009

Paper: A practical scalable distributed B-tree

Wednesday, September 16, 2009 at 2:07AM

We've seen a lot of NoSQL action lately built around distributed hash tables. Btrees are getting jealous. Btrees, once the king of the database world, want their throne back. Paul Buchheit surfaced a paper: A practical scalable distributed B-tree by Marcos K. Aguilera and Wojciech Golab, that might help spark a revolution.

From the Abstract:

We propose a new algorithm for a practical, fault tolerant, and scalable B-tree distributed over a set of servers. Our algorithm supports practical features not present in prior work: transactions that allow atomic execution of multiple operations over multiple B-trees, online migration of B-tree nodes between servers, and dynamic addition and removal of servers. Moreover, our algorithm is conceptually simple: we use transactions to manipulate B-tree nodes so that clients need not use complicated concurrency and locking protocols used in prior work. To execute these transactions quickly, we rely on three techniques: (1) We use optimistic concurrency control, so that B-tree nodes are not locked during transaction execution, only during commit. This well-known technique works well because B-trees have little contention on update. (2) We replicate inner nodes at clients. These replicas are lazy, and hence lightweight, and they are very helpful to reduce client-server communication while traversing the B-tree. (3)We replicate version numbers of inner nodes across servers, so that clients can validate their
transactions efficiently, without creating bottlenecks at the root node and other upper levels in the tree.

Distributed hash tables are scalable because records area easily distributed across a cluster which gives the golden ability to perform many writes in parallel. The problem is keyed access is very limited.

A lot of the time you want to iterate through records or search records in a sorted order. Sorted could mean time stamp order, for example, or last name order as another example.

Access to data in sorted order is what btrees are for. But we simply haven't seen distributed btree systems develop. Instead, you would have to use some sort of map-reduce mechanism to efficiently scan all the records or you would have to maintain the information in some other way.

This paper points the way to do some really cool things at a system level:

It's distributed so it can scale dynamically in size and handle writes in parallel.

It supports adding and dropping servers dynamically, which is an essential requirement for architectures based on elastic cloud infrastructures.

Data can be migrated to other nodes, which is essential for maintenance.

Multiple records can be involved in transactions which is essential for the complex data manipulations that happen in real systems. This is accomplished via a version number mechanism that looks something like MVCC.

Optimistic concurrency, that is, the ability to change data without explicit locking, makes the job for programmers a lot easier.

These are the kind of features needed for systems in the field. Hopefully we'll start seeing more systems offering richer access structures while still maintaining scalability.

Todd Hoff |

5 Comments |

Permalink |

Print Article

Email Article

Paper

Saturday

Aug082009

Yahoo!'s PNUTS Database: Too Hot, Too Cold or Just Right?

Saturday, August 8, 2009 at 2:06AM

So far every massively scalable database is a bundle of compromises. For some the weak guarantees of Amazon's eventual consistency model are too cold. For many the strong guarantees of standard RDBMS distributed transactions are too hot. Google App Engine tries to get it just right with entity groups. Yahoo! is also trying to get is just right by offering per-record timeline consistency, which hopes to serve up a heaping bowl of rich database functionality and low latency at massive scale:

We describe PNUTS [Platform for Nimble Universal Table Storage], a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con-current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

Some of the cool things about PNUTS are:

They actually talk about the hard problem of how to scale a system to 10 different data centers (each with 1,000+ servers) while supporting secondary indexes, materialized views, the ability to create multiple tables, and hash-organized tables. Multi-datacenter operation is so difficult it's usually ignored. PNUTS is designed specifically to operate in many datacenters with a strongish consistency model, which makes it a very interesting design point.

You can subscribe to a reliable ordered stream of updates on a table. This is massively convenient. For many applications numerous processes are tied to data changes and this is normally a pain to implement.

The consistency model is a per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order. This provides a consistency model that is between the two extremes of serialized transactions and eventual consistency. Conflicting records can't exist at the same time as is allowed by Dynamo.

Supports records, but accepts queries only on individual tables. There's no fixed schema for records and columns can be typed or be blobs. Transactions exist only at the record level. This means like with other NoSQL databases denormalization is the modeling strategy as there's no way to have transactions across tables.

The degree of read consistency desired can be specified. Records are versioned. You can ask for the latest record version or allow for potentially stale records.

Asynchronous replication is used to ensure low write latency while providing geographic replication. Reads should be fast everywhere and may return older versions, writes should be fast locally (in the same datacenter). [3]

A message broker that serves both as the replication mechanism and redo log of the database. The message broker guarantees no replica can receive updates out of order because it provided a reliable, totally ordered message channel. This also means you can't have transactions across tables, consistent joins, or foreign keys. They chose this approach over a gossip mechanism (like Dynamo) because it can be optimized for geographically distant replicas and because replicas do not need to know the location of other replicas.

PNUTS is hosted and centrally managed. It was built to reduce the overhead of creating and maintaining new applications rather than every property creating their own system. Failover, adding capacity, resharding, performance isolation and supporting applications with different usage profiles are all completely automated.

Predicate queries are supported using a scatter-gather mechanism which sends the query to every relevant storage tablet at once. The are gathered sent back to the client.

Performance is 1-10ms/request when caching layers are in place.

From a system perspective PNUTS offers a lot of the good things: hosted, reliability, lowish latency, automation, scalability, supports many application models, and there's a lot of room to improvement that all applications will be able to take advantage of when available.

From a programmer perspective the are also a lot of good things: it's hosted so fewer worries, notifications, flexible schemas, ordered records, secondary indexes, lowish latency, strong consistency on a single record, scalability, high write rates, reliability, and range queries over a small set of records.

Unfortunately Goldilocks still needs to keep searching for just right, though she may be getting closer. From a system perspective Yahoo!'s ideas are good, but they don't help you as the system isn't available for you to use. From a programmer perspective the programmer's job is still way too hard. To be just right programmer's need low latency aggregate operators, complex transactions, scalable counters, automatic relationship management, and all the other features that will help them just buy instant porridge and be done with it.

Anti-RDBMS: A list of distributed key-value stores

Details on Yahoo's distributed database by Greg Linden

Thoughts on Yahoo's PNUTS distributed database by Marton Trencseni

Data Challenges at Yahoo! - Ricardo Baeza-Yates & Raghu Ramakrishnan
Yahoo! Research

PNUTS: Yahoo!'s Hosted Data Serving Platform by Lucian

Yahoo’s PNUTS by Henry Robinson. A very thoughtful and informative overview of the paper.

How robust are gossip-based communication protocols? by Lorenzo Alvisi et al.

Trading consistency for scalability in distributed architectures.

Asynchronous View Maintenance for VLSD Databases by Parag Agrawal et al.

BigTable

Dynamo

Are Cloud Based Memory Architectures the Next Big Thing?

The Story of Goldilocks and the Three Bears

PNUTS - Platform for Nimble Universal Table Storage

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

Paper,

key-value store

Tuesday

Jul212009

Paper: Parallelizing the Web Browser

Tuesday, July 21, 2009 at 3:38AM

There have been reports that software engineering is dead. Maybe, like the future, software engineering is simply not evenly distributed? When you read this paper I think you'll agree there is some real engineering going on, it's just that most of the things we need to build do not require real engineering. Much like my old childhood tree fort could be patched together and was "good enough." This brings to mind the old joke: If a software tree falls in the woods would anyone hear it fall? Only if it tweeted on the way down...

What this paper really showed me is we need not only to change programming practices and constructs, but we also need to design solutions that allow for deep parallelism to begin with. Grafting parallelism on later is difficult. Parallel execution requires knowing precisely how components are dependent on each other and that level of precision tends to go far beyond the human attention span.

In particular this paper deals with how to parallelize the browser on cell phones. We are entering a multi-core smartphone dominated world. As network connections become faster, applications, like the browser, become CPU bound:

On an equivalent network connection, the iPhone browser is 5 to 10 times slower than Firefox on a fast laptop. The browser is CPU-bound because it is a compiler (for HTML), a page layout engine (for CSS), and an interpreter (for JavaScript); all three tasks are on a user’s critical path.

To speed up the browser they worked on: offloading computation, removing the abstraction tax, and parallelizing the browser using energy efficient data and task approaches. The problem is technologies like HTML, CSS, DOM, Javascript, events, and page layout were not designed to be parallel. They were designed to be run on a single CPU. And the paper goes to brilliant and heroic lengths to parallelize this part of the stack. They designed new work-efficient FSM algorithms, speculative parallelization for flow layouts, eliminating as much shared state as possible, callback dependency analysis, using actors to implement behaviours, and many more.

What's clear though is their job would have been a heck of a lot easier if the stack would have been designed with parallelization in mind from the beginning.

Leo Meyerovich, one of the authors of the paper, talks about the need for a more rigorous underpinning in blog postThe Point of Semantics:

As part of the preparation for a paper submission, I'm finishing up my formalization of a subset of CSS 2.1 (blocks, inlines, inline-blocks, and floats) from last year. My first two, direct formalization approaches failed the smell test so Ras and I created a more orthogonal kernel language. It's small, and as the CSS spec is a scattered hodge-podge of prose and visual examples riddled with ambiguities, we phrase it as a total and deterministic attribute grammar that is easy to evaluate in parallel.

I asked Leo what rules we could follow to create more parallelizable constructs from the beginning and he said that's what he'll be working on for the next couple years :-) Some advice he had was:

Be clear on what you want to parallelize. Figuring out where the parallelism should be, at a conceptual level, is always the first step.

Understand how it should run in parallel.

Focus on making it easy to do just that (and worry about the rest later).

It's better to completely solve a problem for some folks than almost solve a problem for many: you can help more and more in the former, but with the latter, you might never end up helping anybody.

Some things Leo will be working on are:

I've been enjoying higher-order data flow models (Flapjax) and task parallelism (Cilk++) for awhile now and have been thinking about this, including support for controlled sharing (e.g., SharC for type qualifiers and I'm still trying to figure out implicitly transactional flows for FRP). For a browser, I think it will remain as specialized libraries written in privileged languages where good engineers can rock and put together and be exposed in higher-level languages. Hopefully gradually typing will extend into lower levels to support this. The above hints at a layered framework with the bulk in the high-level -- think parallel scripting. However, as a community, we don't know how to include performance guides in large software, so parallelism is a challenge. I prototyped one of my algorithms in a parallel python variant: the sequential C was magnitudes faster than then 20-core python. Of course, the parallel C++ was even faster :)

Parallelizing the Web Browser by Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovi´c, Rastislav Bodík.

Parallelizing the Web Browser
Browsing Web 3.0 on 3 Watts

Leo Meyerovich's Project Website

Flapjax - a new programming language designed around the demands of modern, client-based Web applications: event-driven, reactive evaluation, event-stream abstraction for communicating with web services, interfaces to external web services.

Bell's Law of Computer Classes

Leo Meyerovich's Blog

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

Paper,

Parallelism

Wednesday

Jun102009

Paper: Graph Databases and the Future of Large-Scale Knowledge Management

Wednesday, June 10, 2009 at 4:02AM

Relational databases, document databases, and distributed hash tables get most of the hype these days, but there's another option: graph databases. Back to the future it seems. Here's a really interesting paper by Marko A. Rodriguez introducing the graph model and it's extension to representing the world wide web of data.

Modern day open source and commercial graph databases can store on the order of 1 billion relationships with some databases reaching the 10 billion mark. These developments are making the graph database practical for applications that require large-scale knowledge structures. Moreover, with
the Web of Data standards set forth by the Linked Data community, it is possible to interlink graph databases across the web into a giant global knowledge structure. This talk will discuss graph databases, their underlying data model, their querying mechanisms, and the benefits of the graph data structure for modeling and analysis.

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

Paper,

databases,

graph

Wednesday

May202009

Paper: Flux: An Adaptive Partitioning Operator for Continuous Query Systems

Wednesday, May 20, 2009 at 1:42AM

At the core of the new real-time web, which is really really old, are continuous queries. I like how this paper proposed to handle dynamic demand and dynamic resource availability by making the underlying system adaptable, which seems like a very cloudy kind of thing to do. Abstract:

The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple queries. The scalability of these dataflows is limited by their constituent, stateful operators – e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Longrunning CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called Flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producerconsumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the Flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the Flux mechanism and these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case

Click to read more ...

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

Paper