Best unofficial Apache Server developers community |
| |||||
| Jul 4, 2010 | |||||
|
Dan Harvey |
|
||||
| Tags: | |||||
Similar Threads
How to apply RDBMS table updates and deletes into Hadoop
To generate smart output from base data we need to copy some base tables from relational database into Hadoop. Some of them are big. To dump the entire table into Hadoop everyday is not an option since there are like 30+ tables and each would take several hours. The methodology that we approached is to get the entire table dump first. Then each day or every 4-6 hours get only insert/update/delete since the last copy from RDBMS (based on a date field in the table). Using Hive do outer join + union the new data with existing data and write into a new file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records inserted, 2 records updated and 1 deleted since the last Hadoop copy, then the Hive query will get 97 of the not changed data + 3 inserts + 2 updates and write into a new file. The other applications like Pig or Hive will pick the most recent file to use when selecting/loading data from those base table data files. This logic is working fine in lower environments for small size tables. With production data, for about 30GB size table, the incremental re-generation of the file in Hadoop is still taking several hours. I tried using zipped version and it took even longer time. I am not convinced that this is the best we can do to handle updates and deletes since we had to re-write 29GB unchanged data of the 30GB file again into a new file. ...and this is not the biggest table. I am thinking that this should be problem for many companies. What are the other approaches to apply updates and deletes on base tables to the Hadoop data files? We have 4 data nodes and using version 20.3. Thanks!
Re: How to apply RDBMS table updates and deletes into Hadoop
Thank you for your response. I understand... Just a few points before I accept that this is too complicated :) The main idea is to keep different versions of data under the same table, similar to HBase but this is row level and you don't have to make the other versions accessible from Hive but only the most recent one. You just need to create an access layer to work on the most recent version of the row. If you can think of a different way of uniquely identifying a row to know the versions of it and timestamp (or counter or version #??) to know the most recent one, it doesn't have to be the columns that I specified before. It can be a different file that you create in the background (which can also be the index file!!). Oracle has ROWID for physical location of the row and locks it before the data manipulation. Hadoop has advantage of storage and map-reduce. So why not use it and keep all versions of changed data and access it via map-reduce for the most recent one. Accessing the data can get slower over time when there are many versions. And that can be fixed with flush or full replication of data time to time in a maintenance window by the end user. Hive is a great tool to access and manipulate Hadoop files. You are doing an amazing job. I have no idea what are the complications you face each day. Just disregard if I am talking nonsense to you keep up the good work! Cheers! Atreju, Your work is great. Personally I would not get too tied up in the transactional side of hive. Once you start dealing with locking and concurrency the problem becomes tricky. We hivers have a long time tradition on 'punting' on complicated stuff we do not want to deal with. :) Thus we only have 'Insert Overwrite' no 'insert update' :) Again, I think you wrote a really cool application. It would make a great use case, blog post, or a stand alone application. Call it HiveMysqlRsync or something :). However you mention several requirements that are specific to your application timestamp and primary key. If you can abstract all your application specific logic it could make it's way into hive. But it might be a stand alone program because hive to rdbms replication might be a little out of scope. Edward
HBase on Hadoop 0.21
Hi, I've checked the Release Notes of HDFS 0.21 and saw two fixes from hadoop- append included, other two not, but still some more that have to do with sync stuff. Is Hadoop-append for HBase made obsolete with HDFS 0.21? Thank you, Thomas Koch, http://www.koch.ro
HBASE/HADOOP Examples
I've found examples using the older mapred interface but not the newer mapreduce interface. I want to write a mapper that is configured to only pull out specific rows(which are the mapper's keys) and a specific column's value(which is the mapper's value). Is there any examples of something like this available? James Kilbride
Re: Hadoop support for hbase
Hello folks, I created a branch for doing the append/sync support for Hadoop 0.20. You can fetch the branch via http://svn.apache.org/repos/asf/hadoo...anch-0.20-append/ If you feel that there are some JIRAS that need to go into this branch, please update the fix-version of those JIRAS with the tag "0.20-append</jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&resolution=-1&pid=12310942&fixfor=12315103> ". thanks, dhruba On Mon, May 10, 2010 at 11:35 PM, Dhruba Borthakur <dhru### @gmail.com> wrote: @Allen: we are definitely behind 0.21 release. Tom White is guiding that release and most developers are committed to removing blockers for that release. Todd rightly mentions that the work being done for 0.20 benefits 0.21 as well. @Jay: Thanks for summing it up so well. I completely agree with your viewpoint. thanks dhruba On Mon, May 10, 2010 at 2:06 PM, Jay Booth <jayb### @gmail.com> wrote: > Given that the 0.20-append branch pretty much already exists > unofficially, via IRC, IM and email forwarded patchsets, it seems like > giving it an official home is just recognizing the status quo. > Especially since 0.21 probably won't be getting rolled out into > production everywhere the first day it's officially released. If the > work's going on anyways, I don't see how giving people a shared home > hurts matters, if anything it gives them a better shared touchpoint > for forward-porting bugfixes to 0.21. > > A case could be made that by making it more painful to run > 0.20-append, more momentum is created towards 0.21 but since Tom is > already on top of 21 and seemingly doing an excellent job, and since > the HBase community will probably be some of the first people to move > to 0.21 anyways, I don't see why having 0.20-append will damage 0.21's > momentum at this point. > > > > On Mon, May 10, 2010 at 4:21 PM, Michael Segel > <michae### @hotmail.com> wrote: > > > > > > > >> From: to### @cloudera.com > >> Date: Mon, 10 May 2010 10:45:13 -0700 > >> Subject: Re: Hadoop support for hbase > >> To: gene### @hadoop.apache.org > >> > > > >> > The above is a fallacious setup. How does a branch in 0.20 detract > >> > from the 0.21 momentum (The append feature that we'd work on in 0.20 > >> > branch has little relation to how append works in 0.21). > >> > >> For what it's worth, though, the majority of the size of the 0.20 > >> append patch is made up of additional unit tests. I have started > >> forward-porting these new tests to the trunk append and it's already > >> exposed a number of bugs. So while it's tempting to say that the 0.20 > >> append is "wasted effort", it really is benefiting the entire > >> community and the 0.21 release as well. > >> > >> -Todd > >> > > > > Sometimes you have to slow down to go faster. > > > > > > > >
NoClassDefFoundError: org/apache/hadoop/hbase/rest/Main
I am trying to start and stop stargate rest server. I get
ClassNotFoundException intermittently.
I did perform these steps :
? Place the Stargate jar in either the HBase installation root
directory or lib/ directories.
? Copy the jars from contrib/stargate/lib/ into the lib/ directory of
the HBase installation.
:/usr/local/hbase-0.20.3 hadoop$./bin/hbase
org.apache.hadoop.hbase.stargate.Main -p 8080
2010-07-03 04:32:39.593::INFO: Logging to STDERR via
org.mortbay.log.StdErrLog
2010-07-03 04:32:39.633::INFO: jetty-6.1.14
2010-07-03 04:32:39.908::INFO: Started SocketC### @0.0.0.0:8080
^Z
[1]+ Stopped ./bin/hbase
org.apache.hadoop.hbase.stargate.Main -p 8080
:/usr/local/hbase-0.20.3 hadoop$bg
[1]+ ./bin/hbase org.apache.hadoop.hbase.stargate.Main -p 8080 &
:/usr/local/hbase-0.20.3 hadoop$./bin/hbase-daemon.sh start
org.apache.hadoop.hbase.rest.Main -p 8080
starting org.apache.hadoop.hbase.rest.Main, logging to
/var/hbase/logs/hbase--org.apache.hadoop.hbase.rest.Main-phxradar03.out
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/rest/Main
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.rest.Main
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class: org.apache.hadoop.hbase.rest.Main. Program
will exit.
ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/mapreduce/TableInputFormat
Hi All, This is my first mail in the apache mailing list... please bear with me as I am absolutely new to Hadoop and its family. This is my question... I have some data on my hdfs in the following form. (number:int,word:chararray, word2:chararray,somethingelse:int) I want to get this data into a neatly formed HBase Table. I chose the simpler way instead of writing my own udf. I wanted to do this.... register ../hbase/hbase-0.20.4.jar; register ../hbase/hbase-0.20.4-test.jar; A = Load '/some_data'; B = STORE A into 'hbase://something' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage; dump B; but this is the error I get when I do that 2010-07-22 16:38:35,041 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://MyMachine01:9000 2010-07-22 16:38:35,550 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: MyMachine01:9001 2010-07-22 16:38:35,868 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/mapreduce/TableInputFormat I have checked my hbase-0.20.4.jar file and it does have a TableInputFormat class. I added the right path to hadoop-env.sh in the CLASSPATH field. I added the conf folder to the classpath and also the test jar. I don't know why it wouldn't work. My HBase installation went really smooth. I am able to check the status of the HBase in the hbase shell and still I get this error. I am totally lost at this point. I would really appreciate any help in this regard. Thanks a bunch. V.
How to specify HBase cluster end-points from HBase client code in HBase 0.20.0
Hello, In my current application environment, I need to have two HBase clusters running in two different racks, to form a fault-tolerant group to tolerate power failure. Then I have an HBase client, which is sitting outside of these two clusters, to make invocation to the these two HBase clusters. In my previous work, I just need to simply use the class of “HTable”, and passed in an instance of HBaseConfiguration. And To construct the HBaseConfiguration instance, I just need to pass in the path information of the “hbase-site.xml”. And in the hbase-site.xml, there is only one parameter called “hbase.rootdir” that need to configure. Before HBase0.20.0, there used to be a parameter called “hbase.master” that I can specify. But in HBase0.20.0, I found that it does not work any more, likely because that the HBase master is managed by the Zookeeper, and the master node now becomes dynamic. Could you show me what are the APIs that I need to use, in order for me to specify the end-point address of the HBase cluster, for the HBase client invocation? Regards, Jun
rolling 2.2.16 tomorrow
Hi, I'll start the tagging + voting on 2.2.16 tomorrow unless everyone starts screaming...... Thanks, Paul
how I can do to configure/start a hadoop cluster(pseudo distributed) with the last hadoop trunk cod
All, I have followed the instructions on http://wiki.apache.org/hadoop/EclipseEnvironment to download the latest trunk source code and build .jar for common, hdfs and mapred. but how should I proceed to configure and start a hadoop cluster(psudo distributed) with these latest .jar? I knew how to configured/start the hadoop cluster with formal hadoop package(hadoop-*.tar.gz with all stuff of common, hdfs and mapred there). I googled but didn't find the related information, most information I got after compile is to run unit test. Can anyone help? Thanks for the help. Best Regards, Fred
Why updates PUT only
Hello, I just learned about update handler - really like them! But why are they only usable with the PUT verb? That means they cannot be used with HTML forms. thank you for reading, regards, alux
Parent POM updates
I'd like to suggest the following updates to the HC parent POM:
1) add the following profile:
<!--
| Profile to allow testing of deploy phase
| e.g.
| mvn deploy -Ptest-deploy -Prelease -Dgpg.skip
-->
<profile>
<id>test-deploy</id>
<properties>
<altDeploymentRepository>id::default::file:target/deploy</altDeploymentRepository>
</properties>
</profile>
I found this very useful in BSF3 for checking the contents of Maven
artifacts
2) Change the <distributionManagement> section URLs to be the same
as
in the Apache parent POM v7.
This would mean that Maven deployment would be done via Nexus, rather
than directly to the live forge. A lot safer, as the artifacts cannot
be accidentally released.
It also allows the artifacts to be reviewed before release.
Also, Nexus performs checks on the sigs and hashes (e.g. I forgot to
publish my new key to the key server, and it would not let me upload).
The RM would need to login to Nexus in order to close the upload, and
then use Nexus again once the vote has completed, but this does not
take long.
Thoughts?
Live updates in Cassandra
Does anyone have any advice for me about how to do live updates (i.e. updates on clusters while they are in use)? My situation is like this: Some data comes from the "back office" - this data is imported to Cassandra from Oracle. Some data comes from users - this data is imported to Oracle from Cassandra. For nomalized data, I don't see a problem updating Cassandra from Oracle. But updating my indexes on a live database is tricky, because an update of a single normalized item results in a complex set of updates to indexes, in which there are a lot of data inter-dependencies (i.e. the Cassandra equivalent of complex SQL queries). Deletes are especially tricky. How about using two Cassandra keyspaces, one live, and one off-line, updating the off-line keyspace, and then making it live? This would allow me to start fresh with each update - easier than an incremental update. What tools to I have to help me with this? Any other thoughts? All suggestions are welcome!
Staging repositories updates
Just to get everyone updated on where we are on the staging repositories.. Last Saturday, we reviewed and revised the patch that was submitted for MRM-980. The patch contains the new module for the repo merge, with the class for getting the artifacts in the source repo implemented. One of the things that was encountered in using the new repository API was having to do nested loops through the namespace, project & version then through a list of ArtifactMetadata in order to get the artifacts. (Brett, is there an easier way to do this?) The next item that needs to be done now is to implement the repository merge API. As a starting point, here are the test cases that we've identified which needs to be covered: - no artifacts with the same groupId & artifactId exist in the target repo - artifacts with the same groupId & artifactId (but with a different version) exist in the target repo - artifacts with the same version exist in the target repo - repositories have different layouts (ex. source repo is legacy type, target repo is default type) - source repo is empty - target repo is empty Thanks, Deng
Latest OpenJPA trunk updates
As of r966020, the OpenJPA trunk (2.1.0-SNAPSHOT) now has the following long sought after improvements - OPENJPA-1732 LogFactory adapter for SLF4J Now, the SLF4J API can be used by setting openjpa.Log=slf4j and including the required slf4-api and backend adapter on the classpath. OPENJPA-1735 Mark commons-logging as provided in the build to remove transient maven dependency Now, users of openjpa-2.1.0-SNAPSHOT.jar no longer need to include a dependency exclusion for commons-logging. Enjoy! Donald
Routing new data from a DB query, updates from JMS
A quick rundown of our situation...Our system operates in a very dynamic environment where accessing the data to route will have to be done in two 'phases': 1. Perform an initial query to a database(s) and route any new records 2. Listen for updates pushed to a JMS Queue and route I have the book and have been reading as much as I can but I can't figure out how the functionality of querying for new data should best be handled. One thought is to have a component (outside of Camel) that queries for data and pushes to a JMS Queue...at that point Camel could take it from there (using the JMS Queue as a 'from' in a route). My other thought is to integrate this query logic in a Camel route but I get really lost when I think of the best way to do it. Maybe I could use a JPA component to get new data and at the same time have a route that pulls from a JMS Queue? It seems that Camel is exactly what we need for the routing/filtering capabilities but I'm having serious problems in figuring out how to get the data to the Camel infrastructure. Any help/advice would be greatly appreciated!
Geronimo dependency updates, round 2.
I've finally managed to get a pass through with looking at additional packages we might want to update for Geronimo 3.0. This pass focused on the components we create in the bundles geronimo group. For these updates, we'll need to release new component bundles, so this should probably be done soon. Here are the updates that appear to work. Derby 10.5.3.0_1 -> 10.6.1.0 aspectjrt 1.6.2 -> 1.6.8 aspectjweaver 1.6.2 -> 1.6.8 woodstox 3.2.9 -> 4.0.6 sxc-jaxb 0.7.2 -> 0.7.3 sxc-runtime 0.7.2 -> 0.7.3 Mostly minor revision updates. The biggest updates are derby and woodstox. Geronimo builds fine with these changes and the server launches ok. If there are no objections, I'll update the bundles trunk for these package versions and start the process of rolling out new bundle releases for these. Rick
Re: [PATCH/puppet 0/3] Updates to Red Hat config files
Todd Zullinger wrote: These are a few minor updates to the conf/redhat files for 2.6.0. The last patch is something I've submitted previously but hasn't been included yet. Todd Zullinger (3): conf/redhat: Rebase rundir-perms patch conf/redhat: Update conf/init files for single binary conf/redhat: Consistently pass pidfile option to daemon, killproc, and status conf/redhat/client.init | 19 +++++++++++-------- conf/redhat/puppet.conf | 2 +- conf/redhat/rundir-perms.patch | 26 +++++++++++++
PATCH/puppet 0/3] Updates to Red Hat config files
These are a few minor updates to the conf/redhat files for 2.6.0. The
last patch is something I've submitted previously but hasn't been
included yet.
Todd Zullinger (3):
conf/redhat: Rebase rundir-perms patch
conf/redhat: Update conf/init files for single binary
conf/redhat: Consistently pass pidfile option to daemon, killproc,
and status
conf/redhat/client.init | 19 +++++++++++--------
conf/redhat/puppet.conf | 2 +-
conf/redhat/rundir-perms.patch | 26 +++++++++++++
Updated: (AMQ-2281) activemq-flow: Updates to webgen architecture page.
[
https://issues.apache.org/activemq/br...nels:all-tabpanel
]
Rob Davies updated AMQ-2281:
| |||||
(217 lines) Jul 4, 2010 12:38
(234 lines) Jul 5, 2010 12:52