Best unofficial Apache Server developers community |
| |||||
| Aug 5, 2010 | |||||
|
Oleg Anastasjev |
|
||||
| Tags: | |||||
Similar Threads
Cassandra Scaling Questions
Hi All, I've got a couple questions that have come up about how Cassandra works and what others are seeing in their environments. Here goes: 1.) What have you found to be the best ratio of Cassandra row cache to memory free on the system for filesystem cache? Are you tuning it like an RDBMS so Cassandra has the vast majority of the RAM in the system or are you letting the filesystem cache do some of the work? 2.) Is the Cassandra cache write-through (ie are new records held in the row cache as they're written to disk? 3.) When using the random partitioner how much difference should be expected (or has been observed) between nodes? 2%? 10%? 3.5) Can a load balance be expected to bring the data distribution pretty close to even among all nodes in the ring? Is the correct process for a loadbalance to run the loadbalance operation on each node in the ring? Thanks! I'm curious to hear what other's have observed. -Aaron
Some questions about using Cassandra
This is a multi-part message in MIME format. We are currently looking at a distributed database option and so far Cassandra ticks all the boxes. However, I still have some questions. Is there any need for archiving of Cassandra and what backup options are available? As it is a no-data-loss system I'm guessing archiving is not exactly relevant. Is there any concept of Listeners such that when data is added to Cassandra we can fire off another process to do something with that data? E.g. create a copy in a secondary database for Business Intelligence reports? Send the data to an LDAP server? Anthony Ikeda Java Analyst/Programmer Cardlink Services Limited Level 4, 3 Rider Boulevard Rhodes NSW 2138 Web: www.cardlink.com.au | Tel: + 61 2 9646 9221 | Fax: + 61 2 9646 9283 ********************************************************************** This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited. If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form. **********************************************************************
Cassandra questions
Hi,
Being fairly new to Cassandra I have a couple of questions:
1) Is there a way to remove multiple keys/rows in one operation (batch) or
must keys be removed one by one?
2) I see API references to version 0.7, but I couldn't find a alpha or
beta anywhere? Does it exist already and if so, where can I get it? Or
else, when is it planned to be public/released?
Thanks in advance, Hugo.
more questions on Cassandra ACID properties
Hi, I have more questions on Cassandra ACID properties. Say, I have a row that has 3 columns already: colA, colB and colC And, if two *concurrent* clients perform a different insert(...) into the same row, one insert is for colD and the other insert is for colE. Then, Cassandra would guarantee both columns will be added to the same row. Is that correct? That is, insert(...) of a column does NOT involving reading and rewriting other existing columns of the same row? That is, we do not face the following situation: client X: read colA, colB and colC; then write: colA, colB, colC and colD client Y: read colA, colB and colC; then write: colA, colB, colC and colE BTW, it seems to me that insert() API as described in the wiki page: http://wiki.apache.org/cassandra/API should handle updating an existing column as well by the replacing the existing column value. If that is the case, I guess we should change the wording from "insert" to "insert or update" in the wiki doc And, ideally, insert(...) API operation name would be adapted to update_or_insert(...) Looking forward to replies that may confirm my understanding. Thanks! Regards, Alex Yiu
backlogging and scaling
I'm curious if there are any efforts ongoing to amortize the background tasks in Cassandra over time? Specifically, the cost of compaction and AE, rebalancing, etc seems to be a problem for some users when they are expecting more steady-state performance. While this may sometimes be the result of a cluster which is at its marginal capacity, users are still surprised with the performance hit or downtime required for common operations. Making the cluster able to make finer-grained and measurable progress towards the ideal state may help other users, too. Is there a feasible design or enhancement which may allow these types of background tasks to be broken apart into smaller pieces without compromising overall consistency? It would be excellent if the user could see the over-all state of the storage cluster, and to choose the proportion of resources allocated to recovering backlog vs servicing clients, etc. Even better, if there were some basic heuristics which worked well for the general case, and users would only have to see the scheduling plan in special situations. How would you go about doing that? Does the current architecture lend itself to this type of optimization, or otherwise?
questions about cache?
Hi, I've seen people ask before about the entire cache being flushed for a single object commit, and that there is no evict. My question is why? Is there some special reason why the whole cache has to be flushed and we can't just evict a single object? Secondly, under what circumstances does a SQL query go to the cache and when does it bypass the cache and go straight to the database? I ask this cause we have lots of queries like "select * from myTable where field1 = someValue and field2 = anotherValue". Is this sort of query cachable or only select by primary key queries? Raj.
ZK recovery questions
Hi, I've been reading the docs and trying out some basic Zookeeper
examples.
I have a few simple questions related to recovery.
It would be good to have questions like these on the Wiki/docs to avoid
noobs like me asking the same thing over and over.
- If 1 out of 3 servers crashes and the log files are unrecoverable,
how
do we provision a replacement server?
- If the server log is recoverable but provisioning takes a long time,
then what happens if the old log file is far behind the current state?
The
docs say that recovery is based on fuzzy check pointing and snapshots
but I
wasn't clear as to how long "catching up" would take
- What happens at the client side code if a server quorum is lost? Does
the ZK service freeze or continue to service just reads?
- If there was a temporary glitch (n/w or GC) and the replica to
which
the client is connected breaks away from the quorum does the client
get
notified? Does it stop processing client requests? Does it rejoin
the
cluster without manual intervention?
- Now if even the client cannot connect to other servers (split
brain)
.. ... well I suppose this question is moot
- Do the servers really have to run with file based persistence? I saw
that someone wanted this in-memory mode for unit testing (ZK
694<https://issues.apache.org/jira/browse/ZOOKEEPER-694>)
but there are cases where only a transient ZK service is needed. Most
enterprise systems have replicated Databases anyway. So, the fear of
data
loss is minimal. If ZK logs are the only means of recovery, then this
might
be harder to implement
- A client example with full fledged error handling would be very
useful
for starters. I'm not sure if http://github.com/sgroschupf/zkclient and
http://code.google.com/p/cages/ have everything but they do look
promising. Plain ZK API is a bit overwhelming :)
Thanks,
Ashwin.
New To Puppet - Two Questions
New to Puppet, heard about it for the first time at OSCON. Two quick questions: 1. Is there a web interface? This is really key to our company since we have some dev/ops people but also some customer service people (not command-line savvy) who need to do things. 2. Does it just manage server configuration or could I write custom extensions or modules to do things like list all of our customers who have accounts on a server, add/remove customers from our database, enable/disable logins to our web app, etc? These would be more like "business operations" not "it/server management operations". -K.R.
Questions in ForEachSupport
In bug 45197 https://issues.apache.org/bugzilla/show_bug.cgi?id=45197 Henri wrote: * Look at questions in the length method in ForEachSupport.java * Look at commented out code in prepre in ForEachSupport.java By "in the length method" did you mean line #241 where the length gets set to 0? That's different to the non-deferred case which throws an exception on line #402. I'd suggest we do the same for both (throwing the exception). The commented code in prepare that sets the itemsName EL variable is redundant as that is also done in LoopTagSupport line #532 (assuming the Apache implementation of the JSTL API). Looks like this can just be deleted. if that sounds good, I'll contribute a patch for those changes. Cheers Jeremy
newbie questions
hi
I have few questions on hive and its use case.
1. hive-on-hadoop-20 accessing/processing data stored on hadoop-18-dfs
The actual files are on hadoop-18 dfs and then I will create external
table on hive-on-hadoop-20 with files pointing to hadoop-18-dfs.
I don't think this is possible , given hadoop version incompatibility;
but never hurts to ask
2. We download tons of urls and massage the data. The massaging goes thru
various stages. We would like to monitor these stages
so I was thinking on doing a schema like following
One Table :
url as STRING,
massage_step1 is a STRUCT
massage_step2 is a STRUCT
.
.
feature_set is ARRAY<STRING>
The STRUCT can have arrays on longs, ids, timestamps,
success/failure, reasons
Assuming tht I am correct track here :
will I able to run queries like :
q1. where massage_step1.reasons like '%Failed on fetching%'
q2. where feature_set like 'shopping'
(feature set is an array, I think I have to implement a UDFLike for
Arrays)
q3. where massage_step2.ids < 10K
q4. where count(*) as count where timestamps < 'SOME_DATE' group
by massage_step1.success = true
In short , can I query on data in the complex types like Struct, Array,
Map etc
3. Some of queries will require data from 2 or more structs and some wont.
In the above example, I keeping it one table (external table). The
other option is multiple tables: one for each massage_step.
In case of multiple tables, I will have to fire JOIN queries and in
case of single table , I will filter data using where clause
What is expensive: JOIN queries or filtering data using where clause ?
Feedback is greatly appreciated
Thanks,
Sagar
Questions about WS-RM behavior in CXF
Hi CXF-experts, Some days ago, I asked my question about the expected behavior of the WS-RM client. I am confused that the network exception is passed back to the client application, while the RetransmissionInterceptor on the client keeps retransmitting the message. I expected that the exception would be handled by one of the RM interceptors so that the client would not get the network exception directly. (my original question http://mail-archives.apache.org/mod_m### @mail.gmail.com>) Looking further into the code today, I see that the exception is passed back to the client because it is set both in message's content and in its exchange map at the PhaseInterceptorChain's doIntercept method below: message.setContent(Exception.class, ex); boolean isOneWay = false; if (message.getExchange() != null) { message.getExchange().put(Exception.class, ex); isOneWay = message.getExchange().isOneWay(); } unwind(message); Later, the existence of the exception object is checked by the ClientImpl class in its processResult method and if found, this throws this exception to the client. Here I have two questions. 1. why does my oneway method initiated by port.greetMeOneWay method of the demo sample invoke the ClientImpl that calls the processResult method? Isn't the processResult method only needed for request-response processing? 2. If the processResult method is called for oneway calls, should the RetransmissionInterceptor's handleFault method remove the exception object from the message and its exchange map? If this is done, the exception would not be removed during the unwind method of the doInterceptor method above and the original exception would not be forwarded to the client application. This seems to work for me. I would appreciate if someone can answer my questions. Thanks. Regards, aki
Questions about data modeling
I'm currently trying to wrap my head around Cassandra which is definitely not easy for a mind deeply entrenched in SQL :) I see how blogs/tweets etc. can be modeled in Cassandra. However, I have a sightly different problem. Let's say we let the user see a random item(article/picture/recipe/you-name-it) and vote for it. We should show the most popular items, the last articles/pictures the user has voted for etc. 1. How can I show the most popular items? 2. How can I present the user with a random item he hasn't seen yet? For the first question I figured I could have a <ColumnFamily CompareWith="LongType" Name="Rating"/> and store lists of items per each rating, updating them as necessary. Can't figure out a way to correctly implement question number 2.
XMPP Component Questions
Greetings,
I'll explain my problem shortly, i'm currently working on a project that
uses apache servicemix and apache camel for routing messages, now i'm
working on the XMPP-related stuff and i have some problems (please forgive
my slightly incompetence)
1) configuring the route in the usual method i am unable to specify
different destinations (i.e. in my project i should be able to retrieve
messages from an XMPP port (and it goes well with the URI
from("xmpp://XMPPConsumer?password=xxxx") , note that i'll use two
components (a consumer named systemrx and a producer named systemtx)
because
i have choice about this (it is an academic project))
the problem is when i want to send the message to the correct address
because the route is builded when camel context starts (for example
setting
a uri like this //.to("xmpp://XMPPProducer/SomeReceiver?password=xxxx");
Specifing the receiver it works well, but i want to change the receiver,
now
i have two possibile solutions
1) Create a route each time i receive a message, (i receive the message,
extract the destination from message body, istantiate the component an
then
send)
2) have a dynamic route like this (is this possibile?)
from("xmpp://XMPPConsumer?password=xxxx").process(SomeProcess).toF("xmpp://XMPPConsumer/%s?password=xxxx",destination);
(destination should be setted in SomeProcess)
(reading the manual i discovered that XMPP component supports headers only
for In Messages, (i suggest that setting a "participant" header for the
outgoing message could do the trick but i don't know if this is possible)
otherwise are possibile any other solutions that i don't know?
thanks for advice, any help would be greatly appreciated
Alessandro
A couple questions on the RPC spec.
This will be my last question for at least the next day or two. :) I just
want to double check my interpretation of the message framing. Assuming
that
the client and server have already gone through their handshake. Does this
sound right for a request/response on the wire?
Request:
4 byte length
map of bytes - request metadata
4 byte length
string - message name
for each parameter {
4 byte length
parameter bytes
}
null-terminate
Response:
4 byte length
map of bytes - request metadata
1 byte - 0 or 1 for success
if(false){
4 byte length
bytes containing response
}
else
{
4 byte length
bytes containing error
}
SocketTransceiver seems to support this but it also has this comment
stating
it's not standard.
/** A socket-based {@link Transceiver} implementation. This uses a
simple,
* non-standard wire protocol and is not intended for production services.
*/
A *call* consists of a request message paired with its resulting response
or
error message. Requests and responses contain extensible metadata, and
both
kinds of messages are framed as described above.
The format of a call request is:
- *request metadata*, a map with values of type bytes
- the *message name*, an Avro string, followed by
- the message *parameters*. Parameters are serialized according to the
message's request declaration.
The format of a call response is:
- *response metadata*, a map with values of type bytes
- a one-byte *error flag* boolean, followed by either:
- if the error flag is false, the message *response*, serialized per
the message's response schema.
- if the error flag is true, the *error*, serialized per the
message's
error union schema.
hbase evaluation questions
I am trying to evaluate hbase to be used as an analytical data store, and I have a few questions I have not been able to answer from the wiki or googling in general. 1) How can hbase be configured for a multi-tenancy model? What are the options to create a solid separation of data? In a relational database schemas would provide this and in cassandra the keyspace can provide the same. Of course we can add the tenancy key to the row key and create tenant specific tables/column families but that does not provide the same level of confidence of separation. We could also create separate clusters for each client, but then that defeats part of the point of going to a distributed database cluster to improve overall throughput+utilization across all clients. We currently run single MySQL databases for each of our clients (1-3 TBs each). 2) I am trying to model data within hbase and I am unable to truly model it as a column based data store due to the limitations of the API (hbase.thrift) in terms of getting back data for certain columns. I see information for defining a bloom filter which I believe could help speed up the retrieval of certain columns within a large row but the API does not seem to offer the ability to iterate through the columns. The API supports the ability to request a list of columns but no way that I have seen to scan columns for a given row key based on a start/stop column. This forces us to create a tall data model vs. a wide data model which in the end we think will hurt performance as more rows will be required. The data model is a std star schema in relational terms with a time dimension. Time is only down to the daily granularity and we would prefer to have this be part of the column key instead of the row key. From all examples I have seen time has always been added to the end of the row key to be accessed via row scans. In Cassandra for example time is modeled as a super column or column composite index and the API supports a range get against a set of columns within a single row. Any advice or pointers would be greatly appreciated. Thanks in advance! Wayne
a questions about calling opensaml lib
Hi, I am writing a apache module mainly for validating saml assertion.Now,I have already post signed assertion data to my apache server and in my handler module ,I have also received assertion data . But ,when I validate assertion with opensaml lib,some questions are appeared: Opensaml Library initialization failed. and program are staying this sentence all the time.Why do this phenomenon take place ?? Thanks for your consideration. Regards. Jia
Failover and replication questions
Hello, I am not sure what the failover functionality does in ActiveMQ. I know that if you have machine A and machine B and machine A dies then anyone wanting to use the service will transparantly failover to using B so there is no loss of service. But I don't know if the data held in the queues is replicated between A and B. Can anyone enlighten me please? Regards, Andrew Marlow
questions regarding bibliographic citations?
Hello: I am employing apache commons math 2.2 in the course of my research, and I wonder if there is a format to cite my use of apache commons math? If anyone has a bibtex entry for citing apache commons projects, that would be most helpful. Tanim Islam
scxml-js] a few questions on the data module
Hi, I'm currently adding support for the data module, and I have a quick question about the specification that I was hoping you could answer for me. For the <data> element, the src attribute can reference a URI containing a legal data value. I think that what constitutes legal data values is defined in the profile. The specification of the ecmascript profile seems to imply that legal data values should be formatted as JSON. But the browser runtime also has good support for XML, and so I think it would be useful to be able to specify whether the data referenced at the URI should be handled as JSON or XML. The handleAs property of the Dojo toolkit's dojo.xhrGet<http://api.dojotoolkit.org/jsdoc/dojo.xhrGet>API is a good example of what I have in mind for this. In the scxml specification, is there any way to specify the way a referenced URI should be handled? Please let me know what you think. Thanks, Jake
questions on documentation for configuring AJP connector
We are currently using - Tomcat - 5.5.25 JDK 1.5 IIS 6 Windows XP 64bit and 32bit machines We are trying to upgrade to the latest connector. While going through the worker properties variables to set we have few questions regarding the following - 1) connection_pool_size - > Usually this is the same as the number of threads per web server process. (cut-paste from the description for connection_pool_size) I am not familiar with IIS - so how do you determine the above? > You should measure how many connections you need during peak activity without performance problems, and then add some percentage depending on your growth rate. How do you determine what is a good percentage? Also does this property have any correlation with the attribute MaxThreads in the <Connector> tag of server.xml? How do you determine what value should you put for MaxThreads? 2) connection_pool_timeout - The server.xml - the default value if not specified explicitly is 60000(60 secs). I see in our server.xml AJP connector tag - its not specified - which means I do need to specify this property connection_pool_timeout in our worker.properties as 60? The documentation says the default for connection_pool_timeout is 0, shouldn't it be 60 if this has to be in synch with server.xml? 3) The worker.loadbalancer.method property - currently not set - but we are thinking of doing as B instead of default R. What do you use in general? Is there a disadvantage to switching from Request to Busyness? 4) Question on server.xml - maxSpareThreads maxThreads minSpareThreads What are the criteria to select appropriate values? For production servers - how do you determine the values to set? Is there a correlation between the values for above(maxSpareThreads, maxThreads, minSpareThreads) - for example - does the maxSpareThreads have to be certain % of maxThreads? Thank you for reading the question. Regard, Rumpa Giri | |||||