Tim's Weblog
Tim Strehle’s links and thoughts on Web apps, software development and Digital Asset Management, since 2002.
2008-01-31

How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

Todd Hoff and Bill Boebel at High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data:

"The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 billion within Hadoop, which equals 6.3 TB compressed.

[...] For example, we wanted to build a tool that would allow our customers to search their logs directly. We had been keeping an eye on the Apache Hadoop project since its inception, and were extremely impressed with its progress and direction. Hadoop is an open-source implementation of Google File System and MapReduce... a system that is designed specifically for large scale distributed data processing. It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. Other companies were already using it for their own log processing. So chose to go with Hadoop. In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr."