Entries in Hadoop (20)

Monday
May112009

Facebook, Hadoop, and Hive

Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run applications that process vast amounts of data), Yahoo being the first.

Learn how they do it and what are the challenges on DBMS2 blog, which is a blog for people who care about database and analytic technologies.

Thursday
Mar122009

QCon London 2009: Database projects to watch closely

Geir Magnusson from 10gen presented a talk titled Cloud Data Persistence or ‘We’re in a database reneaissance - pay attention” today at QCon London 2009. The main message of his talk was that “physical limitations of today’s technology combined with the computational complexity of conventional relational databases are driving databases into new exciting spaces”, or to put it simpler the database landscape is changing and we should keep our eyes on that.

Click to read more ...

Friday
Nov142008

Paper: Pig Latin: A Not-So-Foreign Language for Data Processing

Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki

Click to read more ...

Wednesday
Oct152008

Need help with your Hadoop deployment? This company may help!

A group of top Silicon Valley engineers (ex-Yahoo, Facebook, Google) have come together to launch a new startup called Cloudera. Not yet launched, it intends to help other companies adopt a promising software platform called Hadoop.

Hadoop is an open-source software project (written in Java) designed to let developers write and run applications that process huge amounts of data. While it could potentially improve a wide range of other software, the ecosystem supporting its implementation is still developing. Which is where Cloudera hopes to make a place for itself.

More on Hadoop: It uses the Google-introduced MapReduce systems framework that divides applications into small blocks of work, creating multiple replicas of data blocks that it places on various computer nodes.

It is already in use at large companies like Yahoo.

Read more about Cloudera here.

Click to read more ...

Saturday
Oct042008

Is MapReduce going mainstream?

Compares MapReduce to other parallel processing approaches and suggests new paradigm for clouds and grids

Click to read more ...

Sunday
Sep282008

Product: Happy = Hadoop + Python

Has a Java only Hadoop been getting you down? Now you can be Happy. Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze. There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages. The more ways the better. From the website:

Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an 
easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. 
The current release supports Hadoop 0.17.2.

Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a 
map(records, task) and reduce(key, values, task) function. Then you create an instance of the 
class, set the job parameters (such as inputs and outputs) and call run().

When you call run(), Happy serializes your job instance and copies it and all accompanying 
libraries out to the Hadoop cluster. Then for each task in the Hadoop job, your job instance is 
de-serialized and map or reduce is called.

The task results are written out using a collector, but aggregate statistics and other roll-up 
information can be stored in the happy.results dictionary, which is returned from the run() call.

Jython modules and Java jar files that are being called by your code can be specified using 
the environment variable HAPPY_PATH. These are added to the Python path at startup, and 
are also automatically included when jobs are sent to Hadoop. The path is stored in happy.path 
and can be edited at runtime. 

Click to read more ...

Tuesday
Feb192008

Hadoop Getting Closer to 1.0 Release

Update: Yahoo! Launches World's Largest Hadoop Production Application. A 10,000 core Hadoop cluster produces data used in every Yahoo! Web search query. Raw disk is at 5 Petabytes. Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. Greg Linden thinks the Google cluster has way over 133,000 machines. From an InfoQ interview with project lead Doug Cutting, it appears Hadoop, an open source distributed computing platform, is making good progress towards their 1.0 release. They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year. How they are making progress could be a good model for anyone:

The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. We aim for performance to scale linearly as you increase the cluster size. We learn from this process and then increase the cluster size again. Each time you increase the cluster size reliability becomes a bigger challenge since the number and kind of failures increase.
It 's tempting to say just jump to the end game, don't bother with all those errors and trials, but there's a lot of learning and experience that must be earned on the way to scaling anything.

Click to read more ...

Thursday
Feb072008

Looking for good business examples of compaines using Hadoop

I have read the blog about Mailtrust/Rackspace as well the interesting things with Google and Yahoo. Who else is using Hadoop/MapReduce to solve business problems. TIA johnmwillis.com

Click to read more ...

Wednesday
Jan302008

How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. Then came a scripted version of the same process. The next big evolution was a single machine MySQL version. Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. Data was then broken into Merge Tables based on time so index updates weren't a problem. As more and more data this solution broke down with a combination of load and operational problems. Facing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. The future came a little early this year. The advantage of their new system is that they can now look at their data in anyway they want:

  • Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.
  • When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. This switch has changed how they run their business. Stu Hood nicely sums up the impact: "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff." In the rest of this post Bill describes the evolution of their system and the forces that caused them to move from a relational database solution to a MapReduce system. Before getting started, I'd really like to thank Bill Boebel for spending so much time and effort in creating this very valuable experience report.

    Information Sources

  • MapReduce at Rackspace
  • A document sent to me by Bill Boebel, CTO of Mailtrust (Rackspace's mail division). This post is a little different than normal because most all the content past this point is by Bill, I've just organized it a little differently.

    The Platform

  • Hadoop
  • Hadoop Distributed File System (HDFS)
  • Lucene
  • Solr
  • Tomcat

    The Stats

  • Rackspace has more than 50K devices and 7 data centers.
  • The mail system and logging servers are currently in 3 of the Rackspace data centers.
  • The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 billion within Hadoop, which equals 6.3 TB compressed.
  • Several hundred gigabytes of email log data is generated each day.

    Background on Mailtrust

  • Email hosting company
  • Founded in 1999, merged with Rackspace in 2007, previous name: Webmail.us
  • 80K business customers, 700K mailboxes.
  • 2 hosted mail products: Noteworthy, MS Exchange
  • The Noteworthy System: * Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. * ~600 servers, commodity hardware, designed to work around frequent failures.
  • The MS Exchange System: * MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. * ~100 servers, higher-end hardware, SAN & DAS storage.

    The Architecture

    The way the current Hadoop based system works is:
  • Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time.
  • MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr.
  • Once the indexes have been built, they are compressed and stored away in HDFS.
  • Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.

    The System Evolution

    The Problem

    Mailtrust is a very customer service focused company. It is extremely important for our support techs to be able to examine mail logs in order to troubleshoot problems for our customers. Our support techs need to search the logs hundreds of times per day, so the tools that provide this functionality must be fast and accurate. With over 600 mail servers, and hundreds of gigabytes of raw log data produced each day, this can be tricky to manage. Here is a brief history of the Mailtrust logging architecture, problems we faced, how we over came them, and what the system looks like today...

    Logging v1.0

    Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog. Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.

    Logging v1.1

    Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. An engineer could tell the script what type of mail server to search (inbound smtp, outbound smtp, backend mailbox). The script would look at /etc/hosts for a list of servers of that type, and then iterate through each server, ssh in, perform the grep and then output the results. The script could also search in the past via "gunzip -c /var/log/maillog.* | grep" Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.

    Logging v2.0

    We released a log search tool that the support techs could use directly, without involving the engineers. The support team was given a web-based tool where they could search the logs. It allowed searching by the sender or recipient's email address, domain name or IP address. All of these were indexed fields in a MySQL database. Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. This made cleanup really fast compared to running a conditional DELETE command on a large table. Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size. To get the logs into the database, each mail server initially wrote its log data to a local 16MB tempfs partition. Logrotate was called via cron every 60 seconds to rotate the temporary log file and then preprocess the data before sending it on to the centralized log server. This preprocessing step reduced the volume of data that had to be transmitted over the network to the log server, and this also distributed the processing workload to avoid creating bottleneck on the log server. After the data was processed locally, the script would send comma delimited log data back to syslog-ng on the local server, and syslog-ng would then send it over the network to the centralized log server. The log server was configured to receive data on 6 different ports, one for each type of log data... inbound smtp, outbound smtp, backend smtp, spam/virus filtering, POP3 and IMAP. As log data was received, the records were inserted one by one into the database via MySQL INSERT commands. Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.

    Logging v2.1

    Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time. Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.

    Logging v2.2

    Introduced Merge Tables in order to speed up loading the log data into the database. With this version, every 10 minutes our script would create a new database table and then load the text logs into the empty table. This made the LOAD command extremely fast because there were no existing database indexes that could negatively affect performance. After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. The web search tool was modified to allow searching within the following time ranges: all day, past 12-hours, past 6-hours, past 2-hours. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created. Problems: This version of the logging system worked reliably for about one year. But we began having problems with it as our support team, customer base and server count grew. When we reached about 100 servers the database LOAD operations would take 2-3 minutes to run, which was acceptable, but the server was now always under a heavy cpu and disk IO load. Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy. Also, there were several occasions where our engineers performed a software upgrade to a particular application, which changed log format in such a way that broke the preprocessing script. Since our raw logs were deleted from the local mail servers every 60 seconds, we'd have no way to recover the missing logs when this occurred. Additionally, the log search tool was becoming ever more critical to our support team's daily operations; however, the logging system had no redundancy. There was no RAID, no backups, no failover system. We also do not have a good plan for scaling the log system beyond a single monolithic server. Incrementally patching problems and tweaking performance with the log system was taking up a lot of time and we needed something better. We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.

    Logging v3.0

    While designing v3.0, we looked at several commercial log processing applications. Splunk stood out, and did just about everything we wanted; however, we worried that using a vendor product like this might limit our abilities to build new features down the road. For example, we wanted to build a tool that would allow our customers to search their logs directly. We had been keeping an eye on the Apache Hadoop project since its inception, and were extremely impressed with its progress and direction. Hadoop is an open-source implementation of Google File System and MapReduce... a system that is designed specifically for large scale distributed data processing. It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. Other companies were already using it for their own log processing. So chose to go with Hadoop. In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. The system is described here: http://blog.racklabs.com/?p=66 We believe this new system will be able to scale with us as our company grows. And there is a lot of momentum behind the Hadoop project, which gives us a lot of confidence that its scalability will continue to improve. Yahoo is one of the major contributors to the project and has built Hadoop clusters that contain thousands of servers, and they are aggressively working to get Hadoop to support tens of thousands of servers. Problems: To date, the only problems we have found have been our own bugs; and we fix those as we find them. We are actively running v3.0 today, but we're not going to stop here. We have a lot of plans for new features...

    The Future

    Version 3.1 is being coded currently. It includes new MapReduce jobs that support Microsoft Exchange log processing. (currently we only process Noteworthy logs with this system). We plan to go live in March. In version 4.0 we plan to put the log search tool in the hands of our customers so that they can have the same troubleshooting power that our support team has. This will most likely require reorganizing the way we store log index shards so that they are grouped by user, rather than letting Solr randomly group them. Our resellers seem to be excited about this, because it should allow them to better support their customers. Who knows what we'll build after v4.0...

    Related Articles

  • Google Architecture
  • Database People Hating on MapReduce
  • Product: Hadoop
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3
  • Solr

    Click to read more ...

  • Thursday
    Nov152007

    Video: Dryad: A general-purpose distributed execution platform

    Dryad is Microsoft's answer to Google's map-reduce. What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that executes graphs for you, automatically taking care of scheduling, distribution, and fault tolerance. It's written in C++, but apparently few write directly to this layer, most people use higher layer interfaces. A Job Manager runs the program. It's a library you link in and it loads and executes the graph. A daemon runs on each machine to run jobs. A name server provides access to cluster resources. The DAG is a multigraph so you can have multiple edges between vertices. A DAG was chosen because it's not too cold, or too hot, the porridge is just right. Cycles are too hard. Simpler isn't as useful. DAGs support relational algebra and can split multiple inputs and outputs nicely. One interesting aspect is a a channel is a sequence of structure items that are C++ objects. This means pointers can be passed directly so you don't have to worry about serialization overhead. No restrictions are put on the data model. Graphs are dynamically changeable at runtime which allows for a lot of optimizations. Several case studies were provided. It's probably just me, but I didn't really understand what was going on. Google's example is much better. Everyone can relate to counting words in a document. My thoughts while watching is that the graph stuff sounds cool and general, but it's hard to map it efficiently to solutions when the problems have large numbers of inputs. You have to manually optimize for available RAM and CPUs. The system should do all this work for you. But the graph approach is powerful. The programmer provide the bits of atomic behaviour and the system can then try various optimizations. The code doesn't have to change because the graph can be manipulated abstractly on its own. So you can write something like a SQL query. Then something like a query planner figures out how to execute the query on Dryad.

    Click to read more ...

    Page 1 2