Entries in Parallelism (7)

Sunday
Nov012009

Squeeze more performance from Parallelism

In many posts, such as: The Future of the Parallelism and its Challenges I mentioned that synchronization the access to the shared resource is the major challenge to write parallel code.

The synchronization and coordination take long time from the overall execution time, which reduce the benefits of the parallelism; the synchronization and coordination also reduce the scalability.

There are many forms of synchronization and coordination, such as:

  • Create Task object in frameworks such as: Microsoft TPL, Intel TDD, and Parallel Runtime Library. Create and enqueue task objects require synchronization that it’s takes long time especially if we create it into recursive work such as: Quick Sort algorithm.
  • Synchronization the access to shared data.

But there are a few techniques to avoid these issues, such as: Shared-Nothing, Actor Model, and Hyper Object (A.K.A. Combinable Object). Simply if we reduce the shared data by re-architect our code this will gives us a huge benefits in performance and scalability.

Continue >   http://www.hfadeel.com/Blog/?p=160

 

Tuesday
Jul212009

Paper: Parallelizing the Web Browser

There have been reports that software engineering is dead. Maybe, like the future, software engineering is simply not evenly distributed? When you read this paper I think you'll agree there is some real engineering going on, it's just that most of the things we need to build do not require real engineering. Much like my old childhood tree fort could be patched together and was "good enough." This brings to mind the old joke: If a software tree falls in the woods would anyone hear it fall? Only if it tweeted on the way down...

What this paper really showed me is we need not only to change programming practices and constructs, but we also need to design solutions that allow for deep parallelism to begin with. Grafting parallelism on later is difficult. Parallel execution requires knowing precisely how components are dependent on each other and that level of precision tends to go far beyond the human attention span.

In particular this paper deals with how to parallelize the browser on cell phones. We are entering a multi-core smartphone dominated world. As network connections become faster, applications, like the browser, become CPU bound:

On an equivalent network connection, the iPhone browser is 5 to 10 times slower than Firefox on a fast laptop. The browser is CPU-bound because it is a compiler (for HTML), a page layout engine (for CSS), and an interpreter (for JavaScript); all three tasks are on a user’s critical path.

To speed up the browser they worked on: offloading computation, removing the abstraction tax, and parallelizing the browser using energy efficient data and task approaches. The problem is technologies like HTML, CSS, DOM, Javascript, events, and page layout were not designed to be parallel. They were designed to be run on a single CPU. And the paper goes to brilliant and heroic lengths to parallelize this part of the stack. They designed new work-efficient FSM algorithms, speculative parallelization for flow layouts, eliminating as much shared state as possible, callback dependency analysis, using actors to implement behaviours, and many more.

What's clear though is their job would have been a heck of a lot easier if the stack would have been designed with parallelization in mind from the beginning.

Leo Meyerovich, one of the authors of the paper, talks about the need for a more rigorous underpinning in blog postThe Point of Semantics:

As part of the preparation for a paper submission, I'm finishing up my formalization of a subset of CSS 2.1 (blocks, inlines, inline-blocks, and floats) from last year. My first two, direct formalization approaches failed the smell test so Ras and I created a more orthogonal kernel language. It's small, and as the CSS spec is a scattered hodge-podge of prose and visual examples riddled with ambiguities, we phrase it as a total and deterministic attribute grammar that is easy to evaluate in parallel. 

I asked Leo what rules we could follow to create more parallelizable constructs from the beginning and he said that's what he'll be working on for the next couple years :-) Some advice he had was:

  • Be clear on what you want to parallelize. Figuring out where the parallelism should be, at a conceptual level, is always the first step.
  • Understand how it should run in parallel.
  • Focus on making it easy to do just that (and worry about the rest later).
  • It's better to completely solve a problem for some folks than almost solve a problem for many: you can help more and more in the former, but with the latter, you might never end up helping anybody.

    Some things Leo will be working on are:
    I've been enjoying higher-order data flow models (Flapjax) and task parallelism (Cilk++) for awhile now and have been thinking about this, including support for controlled sharing (e.g., SharC for type qualifiers and I'm still trying to figure out implicitly transactional flows for FRP). For a browser, I think it will remain as specialized libraries written in privileged languages where good engineers can rock and put together and be exposed in higher-level languages. Hopefully gradually typing will extend into lower levels to support this. The above hints at a layered framework with the bulk in the high-level -- think parallel scripting. However, as a community, we don't know how to include performance guides in large software, so parallelism is a challenge. I prototyped one of my algorithms in a parallel python variant: the sequential C was magnitudes faster than then 20-core python. Of course, the parallel C++ was even faster :) 

    Related Articles

  • Parallelizing the Web Browser by Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovi´c, Rastislav Bodík.
  • Parallelizing the Web Browser
    Browsing Web 3.0 on 3 Watts
  • Leo Meyerovich's Project Website
  • Flapjax - a new programming language designed around the demands of modern, client-based Web applications: event-driven, reactive evaluation, event-stream abstraction for communicating with web services, interfaces to external web services.
  • Bell's Law of Computer Classes
  • Leo Meyerovich's Blog
  • Wednesday
    Jul082009

    Art of Parallelism presentation

    This presentation about parallel computing, and it’s discover the following topic:

    • What is parallelism?

    • Why now?

    • How it’s works?

    • What is the current options

    • Parallel Runtime Library. (for more information go there)

    Note: All of my presentation is open source, so feel free to copy it, use it, and re-distribute it.
    Download

    Tuesday
    Jun232009

    Learn How to Exploit Multiple Cores for Better Performance and Scalability

    InfoQueue has this excellent talk by Brian Goetz on the new features being added to Java SE 7 that will allow programmers to fully exploit our massively multi-processor future. While the talk is about Java it's really more general than that and there's a lot to learn here for everyone.

    Brian starts with a short, coherent, and compelling explanation of why programmers can't expect to be saved by ever faster CPUs and why we must learn to exploit the strengths of multiple core computers to make our software go faster.

    Some techniques for exploiting multiple cores are given in an equally short, coherent, and compelling explanation of why divide and conquer as the secret to multi-core bliss, fork-join, how the Java approach differs from map-reduce, and lots of other juicy topics.

    The multi-core "problem" is only going to get worse. Tilera founder Anant Agarwal estimates by 2017 embedded processors could have 4,096 cores, server CPUs might have 512 cores and desktop chips could use 128 cores. Some disagree saying this is too optimistic, but Agarwal maintains the number of cores will double every 18 months.

    An abstract of the talk follows though I would highly recommend watching the whole thing. Brian does a great job.

    Why is Parallelism More Important Now?

  • Coarse grain concurrency was all the rage for Java 5. The hardware reality has changed. The number of cores is increasing so applications must now search for fine grain parallelism (fork-join)
  • As hardware becomes more parallel, more and more cores, software has to look for techniques to find more and more parallelism to keep the hardware busy.
  • Clock rates have been increasing exponentially over the last 30 years or so. Allowed programmers to be lazy because a faster processor would be released that saved your butt. There wasn't a need to tune programs.
  • That wait for faster processor game is up. Around 2003 clock rates stopped increasing. Hit the power wall. Faster processors require more power. Thinner chip conductor lines were required and the thinner lines can't dissipate the increased power without causing overheating which effects the resistance characteristics of the conductors. So you can't keep increasing clock rate.
  • Fastest Intel CPU 4 or 5 years ago was 3.2 Ghz. Today it's about the same or even slower.
  • Easier to build 2.6 Ghz or 2.8 Ghz chips. Moore's law wasn't repealed so we can cram more transistors on each wafer. So more processing power could be put on a chip which leads to putting more and more processing cores on a chip. This is multicore.
  • Multicore systems are the trend. The number of cores will grow at exponential rate for the next 10 years. 4 cores at the low end. The high end 256 (Sun) and 800 (Azul) core systems.
  • More cores per chip instead of faster chips. Moore's law has been redirected to multicore.
  • The problem is it's harder to make a program go faster on a multicore system. A faster chip will run your program faster. If you have a 100 cores you program won't go faster unless you explicitly design it to take advantage of those chips.
  • No free lunch anymore. Must now be able to partition your program so it can run faster by running on multiple cores. And you must be able keep doing that as the number of cores keeps improving.
  • We need a way to specify programs so they can be made parallel as topologies change by adding more cores.
  • As hardware evolves platforms must evolve to take advantage of the new hardware. Started off with course grain tasks which was sufficient given the number of cores. This approach won't work as the number cores increase.
  • Must find finer-grained parallelism. Example sorting and searching data. Opportunities around data. The data can for sorting can be chunked and sorted and the brought together with a merge sort. Searching can be done in parallel by searching subregions of the data and merging the results.
  • Parallel solutions use more CPU in aggregate because of the coordination needed and that data needs to be handled more than once (merge). But the result is faster because it's done in parallel. This adds business value. Faster is better for humans.

    What has Java 7 Added to Support Parallelism?

  • Example problem is to find the max number from a list.
  • The course grained threading approach is to use a thread pool, divide up the numbers, and let the task pool compute the sub problems. A shared task pool is slow as the number increases which forces the work to be more course grained. No way to load balance. Code is ugly. Doesn't match the problem well. The runtime is dominated by how long it takes the longest subtask to run. Had to decide up front how many pieces to divide the problem into.
  • Solution using divide and conquer. Divide set into pieces recursively until the problem is so small the sequential solution is more efficient. Sort the pieces. Merge the results. 0(n log n), but problem is parallelizable. Scales well and can keep many CPUs busy.
  • Divide and conquer uses fork-join to fork off subtasks and wait for them to complete and then join the results. A typical thread pool solution is not efficient. Creates too many threads and creating threads are expensive and use a lot of memory.
  • This approach portable because it's abstract. It doesn't know how many processors are available It's independent of the topology.
  • The fork-join pool is optimized for fine grained operations whereas the thread pool is optimized for course grained operations. Best used for problems without IO. Just computations using CPU that tend to fork off sub problems. Allows data to be shared read-only and used across different computations without copying.
  • This approach scales nearly linearly with the number of hardware threads.
  • The goal for fork-join: Avoid context switches; Have as many threads as hardware threads and keep them all busy; Minimize queue lock contention for data structures. Avoid common task queue.
  • Implementation uses Work-Stealing. Each thread has a work queue that is a double ended queue. Each thread pulls work from the head of queue and processes it. When there's nothing do it steals work from the tail of another queue. No contention for the head because only one thread access it. Rare contention on tail because stealing is infrequent as the stolen work is large which takes them time to process. Process starts with one task. It breaks up the work. Other tasks steal work and start the same process. Load balances without central coordination, few context switches, little coordination.
  • The same approach also works for graph traversal, matrix operations, linear algebra, modeling, generate moves and evaluate the result. Latent parallelism can be found in a lot of places once you start looking.
  • Support higher level operations like ParallelArray. Can specify filtering, transformation, and aggregation options. Not a generalized in-memory database, but has a very transparent cost model. It's clear how many parallel operations are happening. Can look at the code and quickly know what's a parallel operation so you will know the cost.
  • Looks like map reduce except this is scaling across a multicore system, one single JVM, whereas map reduce is across a cluster. The strategy is the same: divide and conquer.
  • Idea is to make specifying parallel operations so easy you wouldn't even think of the serial approach.

    Related Articles

  • The Free Lunch Is Over - A Fundamental Turn Toward Concurrency in Software By Herb Sutter
  • Intuition, Performance, and Scale by Dan Pritchett
  • "Multi-core Mania": A Rebuttal by Ted Neward
  • CPU designers debate multi-core future by Rick Merritt
  • Multicore puts screws to parallel-programming models by Rick Merritt
  • Challenges in Multi-Core Era – Part 1 and Part 2 by Gaston Hillar.
  • Running multiple processes to understand multicore CPUs power.
  • Learning to Program all Over Again by Vineet Gupta
  • Sunday
    May312009

    Parallel Programming for real-world

    Multicore computers shift the burden of software performance from chip designers and architects to software developers.

    What is the parallel Computing ? and what the different between Multi-Threading and Concurrency and Parallelism ? and what is differences between task and data parallel ? and how we can use it ?

    Fundamental article into Parallel Programming...

    Wednesday
    May272009

    The Future of the Parallelism and its Challenges

    The Future of the Parallelism and its Challenges

    Research and education in Parallel computing technologies is more important than ever. Here I present a perspective on the past contributions, current status, and future direction of the parallelism technologies. While machine power will grow impressively, increased parallelism, rather than clock rate, will be driving force in computing in the foreseeable future. This ongoing shift toward parallel architectural paradigms is one of the greatest challenges for the microprocessor and software industries. In 2005, Justin Ratter, chief technology officer of Intel Corporation, said ‘We are at the cusp of a transition to multicore, multithreaded architectures, and we still have not demonstrated the ease of programming the move will require…’

    Key points:

    • A Little history
    • Parallelism Challenges
    • Under the hood, Parallelism Challenges
      • Synchronization problems
      • CAS problems
    • The future of the parallelism

    Click to read more ...

    Tuesday
    Apr212009

    Thread Pool Engine in MS CLR 4, and Work-Stealing scheduling algorithm

    I just saw this article in HFadeel blog that spaek about Parallelism in .NET Framework 4, and how Thread Pool work, and the most faoums scheduling algorithm : Work-stealing algorithm. With preisnation to see it in action.

    Click to read more ...