Best unofficial Apache Server developers community |
| |||||
| Sep 19, 2010 | |||||
|
Dormando |
|
||||
| Tags: | |||||
Similar Threads
Re: Does anyone actively use rebalance?
Rebalance has been immensely useful to us. When adding new storage nodes, we are able to mitigate the damage from HD failures, are able to add nodes more sparingly, and just generally balances out load across storage nodes. ( 12 storage nodes, 6 drives each if it helps ) On Wed, Jul 14, 2010 at 3:12 AM, dormando <dorm### @rydia.net> wrote: Hey, Are there any of you out there who active use the existing "rebalance" feature and have measureable benefits from it? Please confirm that you aren't just running it because it felt like a good idea, and that you actually get results from it? I have a larger plan for rewriting rebalance by wiring it over the new drain code, but I can also just get the new drain code out very quickly which will fix many problems for many people. However in the process I might disable/destroy the existing rebalance code, and it'll stay that way until we can finish writing the rebalance stuff on top of it. If there're enough complainers I'll try to not break the old code, or just wait until I can replace all of it at once... Thanks, -Dormando
Rebalance stuck?
Hi,
I have a problem with enable_rebalance. It seems that after a while it
stopped
working. Now i read in Store.pm about List::Util::shuffle() not being
really
random, is this still true?
I've also setup a test environment to analyse the rebalance problems and
after
three days i'm still not very sure why it also seems to get stuck with only
14
files. The current device usage situation is as follows:
Checking devices...
host device size(G) used(G) free(G) use% ob state
I/O%
---- ------------ ---------- ---------- ---------- ------ ----------
-----
[ 1] dev1 0.176 0.055 0.121 31.01% writeable
0.0
[ 1] dev2 0.176 0.074 0.102 42.16% writeable
0.0
[ 2] dev4 0.176 0.128 0.048 72.51% writeable
0.0
[ 3] dev3 0.176 0.123 0.053 70.03% writeable
0.0
---- ------------ ---------- ---------- ---------- ------
total: 0.703 0.379 0.324 53.93%
where the devices are spread along the three hosts like this:
virtualmedia1 [1]: alive
used(G) free(G) total(G)
dev1: alive 0.054 0.122 0.176
dev2: alive 0.073 0.103 0.176
virtualmedia2 [2]: alive
used(G) free(G) total(G)
dev4: alive 0.127 0.049 0.176
virtualmedia3 [3]: alive
used(G) free(G) total(G)
dev3: alive 0.123 0.053 0.176
I've set enable_rebalance=1 and never gets reset. For all rebalance attemps
it says:
Rebalance for DevFID[d=3;f=36]
(http://192.168.210.1:7500/dev3/0/000/000/0000000036.fid) failed: no
suitable destination devices available
and it seems it tries to itterate over all fids endlessly.
Could someone possibly shed some light on this? In our production
environment
we have a new server on which each device is full for only 9% while the
other
two are filled around 80% and rebalance doesn't work there either. We
really
need this because we pay bandwith per server (if we consume too much) so we
really need to get this balanced.
thanks in advance,
Martijn
source socket, rebalance issues
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'm running mogilefs 2.30 with 14 hosts with 4 devices each. When I !watch on one of my trackers, I see lots (several to many a minute) of messages like these: :: [replicate(9979)] Unable to create source socket to 10.2.128.90:7500 for /dev90128/0/670/784/0670784277.fid :: [replicate(9979)] Failed copying fid 670784277 from devid 90128 to devid 96208 (error type: src_error) :: [replicate(9979)] copy_error: error copying fid 670784277 from devid 90128 during replication :: [replicate(9977)] Unable to create source socket to 10.2.131.210:7500 for /dev210321/0/670/783/0670783216.fid :: [replicate(9977)] Failed copying fid 670783216 from devid 210321 to devid 96208 (error type: src_error) I ran a fsck a while ago(now long completed, according to status) and occasionally see lines like this: :: [fsck(9970)] node 10.2.128.90 seems to be down in get_file_size :: [fsck(9970)] Connectivity problem reaching device 90228 on host 10.2.128.90 very rarely do I actually see real monitor timeouts. mogadm check shows the cluster is fairly bored with not much IO on the hosts. The DB machine isn't overloaded, either. I'm using the zonelocal and network plugins. I'm also noticing that I have a significant number of files way over replicated. My replication policy has a max of 4 for any class, however, out of my 212Mn files in mogilefs, about 2.5Mn have 10 or more copies. Many millions more are replicated 6 times or more. I added some new nodes and ran a rebalanace. After only a couple of percents it would stop. So, I start it again but it stops after a couple more percents, repeat. Any thoughts appreciated. here's some more info: !stats uptime 8367694 pending_queries 0 processing_queries 0 bored_queryworkers 10 queries 2648773 work_queue_for_delete 70 work_queue_for_fsck 150 work_queue_for_replicate 10 !jobs delete count 1 delete desired 1 delete pids 21633 fsck count 1 fsck desired 1 fsck pids 9970 job_master count 1 job_master desired 1 job_master pids 9971 monitor count 1 monitor desired 1 monitor pids 9986 queryworker count 10 queryworker desired 10 queryworker pids 533 1976 2957 12534 22680 24926 27103 27710 28132 29976 reaper count 1 reaper desired 1 reaper pids 22503 replicate count 5 replicate desired 5 replicate pids 9965 10016 14167 14317 31029 mogilefsd.conf: db_dsn = DBI:mysql:blah:blah local_network = 10.0.128.0/22 db_user = ... db_pass = ... listen = 0.0.0.0:7001 conf_port = 7001 listener_jobs = 10 delete_jobs = 1 replicate_jobs = 5 mog_root = /var/lib/mogdata reaper_jobs = 1 plugins = ZoneLocal mogstored.conf: httplisten=0.0.0.0:7500 mgmtlisten=0.0.0.0:7501 docroot=/var/lib/mogdata -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFMbYNa+Idx1gGGQ1YRAgyEAJ9Rxjbo9ajioA3cb8iRJWJLpG19egCfXzxA ot3kTHy2+5k5ZRmxpvWD1tw= =e/jJ -----END PGP SIGNATURE-----
Re: source socket, rebalance issues
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/23/10 12:24 PM, dormando wrote: > It looks like the maxconns default is 10K and I don't set it explicitly. > I assume that is connections that would show up in netstat? I'm seeing > less than 1K active connections there. > > I'm even seeing this occasionally: > > ro### @a0100:/etc/mogilefs# mogadm check > Checking trackers... > 127.0.0.1:7001 ... REQUEST FAILURE (is the tracker up?) > Unable to retrieve host information from tracker(s). > Is it hovering around 1k as in suspiciously close to 1k? or well below 1k? There's a chance that it would have failed to increase the maxconns if not started from root or from a user with adjusted maxconns. Actually, I just changed my methodology slightly to weed out the TIME_WAITS and other stuff and now run: netstat -anp | grep mogstored | wc -l and now see only 300-400 sockets open and still see timeouts when I !watch when you run mogadm check, is it failing immediately or does it feel like a timeout? When it fails, which is probably only 5% of the time, it seems to do so pretty quickly, which could be in about 2s. When I normally do a "mogadm check" there's a bit of a pause before it returns "ok" and when it fails the pause length is about the same amount of time. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFMcwP6+Idx1gGGQ1YRAmnjAJ4qL+0yWP/gPcxP+rJPI4rlRCWIbACfQuLr TfXVy4kxju3kvaatxo/cXrU= =ohD5 -----END PGP SIGNATURE----- | |||||
(91 lines) Sep 19, 2010 14:36
(92 lines) Sep 21, 2010 02:15
(103 lines) Sep 21, 2010 18:50