Best unofficial Apache Server developers community
Username
Forgot password?
Sign in with Twitter account
Sign in with Facebook account

Moving Away From Amazon’s EMR Service to an In-House Hadoop Cluster

Moving Away From Amazon's EMR Service to an In-House Hadoop Cluster:

Many of our systems use Amazon's S3 as a backup repository for log data. Our data became too large to process by traditional techniques, so we started using Amazon's Elastic MapReduce (EMR) to do more expensive queries on our data stored in S3. The major advantage of EMR for us was the lack of operational overhead. With a simple API call, we could have a 20 or 40 node cluster running to crunch our data, which we shutdown at the conclusion of the run. We had two systems interacting with EMR. The first consisted of shell scripts to start an EMR cluster, run a pig script, and load the output data from S3 into our data warehousing system. The second was a Java application that launched pig jobs on an EMR cluster via the Java API and consumed the data in S3 produced by EMR.

What might make you consider moving from the cloud version of MapReduce, the Amazon Elastic MapReduce, to an on-premise Hadoop cluster:

  1. performance and tuning
  2. monitoring
  3. API access
  4. lack of latest features

Original title and link: Moving Away From Amazon's EMR Service to an In-House Hadoop Cluster (NoSQL database©myNoSQL)

Source Article
Comments
0
Be the first to comment

Join with account you already have


Sign in with Twitter account
Sign in with Facebook account
Sign in with Google Friend Connect
avatar
Tags: data warehousing system, shell scripts, cluster performance, emr, amazon, java api, java application, mapreduce, hadoop, premise, pig, node, repository, queries, co
How to use JConsole to connect to a Cassandra cluster in Amazon EC2?
Mar 2, 2011
I want to use JConsole to look at the Cassandra's MBean's attributes (like Memtable stats). But since my Cassandra nodes are running on Amazon EC2 instances, I have to use an ssh login with a ppk authentication file (when connecting via Putty,…

Using Amazon Simple Email Service SES
Feb 15, 2011
Has anyone been able to use this service with commons email? I can't seem to wrap my head around how to do it. Here here is bit of their sample code.... PropertiesCredentials credentials = new PropertiesCredentials( …

Invalid sync error when reading Avro file (Amazon EMR Hadoop)
May 25, 2011
This is a multi-part message in MIME format. Getting this error when reading an Avro file on Amazon EMR Hadoop. Does not occur on any recent Apache Hadoop build. Exception org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!…

how I can do to configure/start a hadoop cluster(pseudo distributed) with the last hadoop trunk cod
Jun 3, 2010
All, I have followed the instructions on http://wiki.apache.org/hadoop/EclipseEnvironment to download the latest trunk source code and build .jar for common, hdfs and mapred. but how should I proceed to configure and start a hadoop cluster(psudo…

Chukwa moving out of Hadoop
Jul 16, 2010
All, As part of Chukwa's migration from Hadoop to incubator, I've moved the subversion over to incubator. Over the next couple weeks, we'll move the web pages, etc. I just didn't want anyone who isn't tracking the chukwa-dev list to get…

Hive should start moving to the new hadoop mapreduce api.
Jul 29, 2010
Hi all, In offline discussions when we fixing HIVE-1492, we think it maybe good now to start thinking to move Hive to use new MapReduce context API, and also start deprecating Hadoop-0.17.0 support in Hive. Basically the new MapReduce API gives…

Created: (HDFS-1488) hadoop will terminate Web service process when a hadoop mapreduce task is finis
Nov 5, 2010
hadoop will terminate Web service process when a hadoop mapreduce task is finished.

Total Space Available on Hadoop Cluster Or Hadoop version of "df".
Oct 1, 2010
Hi, I am using Hadoop 0.20.2 version for data processing by setting up Hadoop Cluster on two nodes. And I am continuously adding more space to the nodes. Can some body let me know how to get the total space available on the hadoop cluster using…

Hadoop Cluster Configuration
Jul 28, 2010
Hi, While setting hadoop cluster, does configuration files (conf/core-site.xml, conf/mapred-site.xml,conf/hdfs-site.xml) in every node(name node and data nodes) needs to be configured in the same manner? How does configuration of name node…

Deploying my job jar on hadoop cluster
Aug 28, 2010
Hi, I want to deploy my map reduce job jar on the Hadoop cluster. I've always done that by doing the following - 1. Copying the job jar to all datanodes 2. Having the job jar on the hadoop classpath on all machines. Isn't hadoop capable of…