[01:29:28] *** Quits: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) (Ping timeout: 260 seconds) [01:31:36] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [03:16:40] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [03:30:20] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [04:17:05] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [08:23:51] drv, I assume you meant to change all the unit test subdir names from _c or .c (like in nvme, etc)? Just making sure, a dir name with a .c at the end threw me for a second :) [09:17:12] yes, I did the initial one that way and nobody commented on it, so I assumed it was OK [09:26:58] sethhowe: any idea on this one? http://spdk.intel.com/public/spdk/builds/review/37e3f5b00a62d8178ce5cf424d8ffa346522bba6.1498147155/ [09:27:10] they are talking about it behind me right now [09:27:32] i was monitoring it during the run, and there was no output in the vm results directories [09:27:41] but it's there now [09:32:07] those VMs are running on a machine in the other lab, so I think the sshfs connection probably went down temporarily [09:32:48] jimharris: I pushed your DPDK patch, so you should be able to do a submodule update now [09:33:03] ok [09:40:33] pool is hung up again [09:40:46] looks like all of the test systems are done [09:40:51] but status still says incomplete [09:46:33] sethhowe was scrolling around in the tmux session running the pool, so it was blocked [09:47:32] jimharris: your submodule update has the wrong commit, I think - did you use the local commit before gerrit rebased it? [09:47:47] My bad on that one. lesson learned. [09:49:32] np [09:49:40] test pool still looks like its stuck [09:49:48] not starting the next patch in the queue [09:49:58] drv: thanks for the heads-up - fixed [09:50:59] It is. It looks like the submodule hash for dpdk on the current build doesn't exist. My error handling is not currently equipped to deal with that. It thinks it is a problem with the gerrit server and is trying to rerun the build over and over. [09:54:27] I will add error handling for that specific case. [09:56:58] jimharris: your stub patch has a typo in the commit message :) ("sentinal") [10:44:15] drv: thanks - i've found another issue i need to fix in that patch so will update the commit message while i'm at it [10:45:25] just using the env layer, the thread was not bound to a core so the scheduler would keep that sleep(1) loop on an idle core [10:46:19] sethhowe: thanks - i'll try not to break the test pool with bad submodule git commit ids :) [12:06:38] drv, OK, just thought maybe my file system was corrupted or something there at first (or maybe I had taken a sleeping pill this morning instead of a vitamin) :) [13:09:36] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [13:14:06] drv: i made a few cosmetic changes to the nbd patch [13:14:28] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 260 seconds) [13:15:26] drv, sethhowe: noticing there seems to be quite a bit of difference lately between the max agent test time and the time reported by the pool [13:15:40] 63 seconds: http://spdk.intel.com/public/spdk/builds/release/master/2462/ [13:16:02] 79 seconds: http://spdk.intel.com/public/spdk/builds/review/fcb514479d007d7d625cad6e24f9a374f134c6a0.1498160996/ [13:16:13] *** Joins: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) [13:16:14] we're thinking most of that is related to the sshfs copies, etc. right? [13:16:40] it seems to wildly vary between test runs though [13:16:46] this one was only 28 seconds: http://spdk.intel.com/public/spdk/builds/review/fce0e5a15599bcc43ec94183e2a94f87b13a7308.1498161860/ [13:17:51] drv: some of the failures on my stub series had the dpdk submodule change in it - i'm suspicious about switching the dpdk submodule for now [13:22:31] THanks Jim, Yes, that's still the belief. I have been meaning to take a deeper look at this for a while. I have been prioritizing configuring the virtual machines, but let me take a deeper look at this today as soon as I add the ubuntu 17.04 machine to the pool. [13:25:32] Also, did you all see my e-mail yesterday about the technical training? Did that idea seem good to all of you? I have gotten positive feedback from anu, peluse, and john. jimharris bwalker drv [13:29:40] i may have to skip the technical training today [13:44:13] jimharris: I was looking back at those two builds you referenced, those two occured when I was testing a hunch about the timeout failure from the virtual machines this morning. [13:45:32] ah - ok [13:45:57] when the host machine is being used for something else (in the case of those two builds I was shell copying a disk image) it blocks the guest vm's from updating the sshfs drive. [13:47:58] But I did take the liberty of checking the timing differences from the other 15 passing builds on the webpage and there was still a lot of variance (22s min value - 44s max value). [13:49:09] I'll keep looking into the variances and see if I can pinpoint it. Thanks. [14:57:53] hmm, this last hang seems to be in the new test that does iscsi_tgt + nvme bdev connecting to nvmf_tgt [14:57:59] via config file [15:05:06] i don't think it's related to my patch [15:05:20] same thing happened on a previous test run on my nbd patch (which definitely could not have caused that) [15:05:32] http://spdk.intel.com/public/spdk/builds/review/2ba0f3a62856432399c25f9ec15a2ac8e7df15cb.1498163200/ [15:06:44] *** Quits: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) (Ping timeout: 255 seconds) [15:09:45] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [15:35:32] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [15:43:54] drv, bwalker: patch to revert the NVMe-oF conf file testing that are causing these test failures: https://review.gerrithub.io/#/c/366696/ [16:16:56] jimharris: I have the rocksdb plugin ported to 5.4.5 - how should I push that? [16:17:10] I can't really do a patch because it's a rebase onto a different history in git [16:17:29] I could push a patch that's a diff between 5.4.5 tip and what would be our new spdk code [16:17:34] ugh [16:17:53] I also confirmed it fixes the XFS thing [16:17:55] I think you just do a git reset [16:18:07] I'm on kernel 4.11 and doing lots of rocksdb runs using it, no prob [16:18:18] I can just force push the spdk branch to be my new one [16:18:22] without review [16:18:30] I just didn't know if you wanted to review the changes I made [16:18:43] I don't think I did very much at all actually [16:18:55] maybe nothing now that I look at it [16:19:09] is it basically 5.4.5 + our minor patches? [16:19:12] yeah [16:19:15] like before? [16:19:19] I just rebased the spdk branch onto 5.4.5 [16:19:27] I squashed our branch as I rebased [16:19:36] it was too hard to make the individual patches each pass [16:19:41] excellent - i meant to bring that up last night [16:19:51] no - squashing was the way to go [16:20:00] so it's just a single patch on top of 5.4.5 [16:20:05] do you want me to push that as a review? [16:20:10] how many changes to our .cc file? [16:20:19] I believe 0 [16:20:29] yes 0 [16:20:34] our.cc file is in the spdk repo [16:20:36] so no changes at all [16:20:49] I did have to change the makefile fragment in our repo to include env/ instead of util/ [16:20:51] so there's no synchronization needed between moving the spdk branch and changes to our .cc file [16:21:11] I can push a patch that adds -Ienv/ to the spdk makefile fragment [16:21:16] but leave -Iutil/ [16:21:21] good idea [16:21:23] and then our makefile fragment will work with both [16:21:33] I'll do that here in a minute [16:21:47] if it's not too hard, sure I wouldn't mind reviewing the patch on top of 5.4.5 [16:22:04] gerrithub, text, whatever [16:22:08] I'll make that happen - I think I'll have to force push the spdk branch to be 5.4.5, temporarily losing our changes [16:22:13] then I'll push the patch on top of that [16:22:24] we could make a new branch, the way we're doing dpdk [16:22:24] I can name it spdk-5.4.5 [16:22:40] ok I'll sort that out and send the patches [16:28:34] https://review.gerrithub.io/#/c/366698/ and https://review.gerrithub.io/#/c/366699/ [16:28:51] don't merge the rocksdb patch yet - I'm doing a ton of benchmarking so we may as well wait to make sure everything is solid [16:29:35] I don't want to find out that we introduced an intermittent failure or something [16:31:15] all looks good to me - go ahead and push as you see fit [16:38:48] I've run a few rocksdb experiments for randread with just the kernel [16:39:10] if I turn off the rocksdb cache and rely entirely on the kernel, I get 121k I/Ops [16:39:15] Get() I/Ops [16:39:43] if I switch to direct I/O with a 16GB block cache and a 16GB compressed cache, I get ~111k I/Ops [16:39:54] compressed cache? [16:40:07] it's a newish thing - it caches the compressed data from sst files [16:40:14] but I believe it is only used for compaction [16:40:19] so it isn't having an impact here [16:40:31] in fact, my third run was direct I/O with 16GB block cache and no comrpessed cache [16:40:39] same performance as with the compressed cache [16:40:53] if we have compression turned off, is that compressed cache still used? [16:41:02] we don't enable compression in our run_tests.sh script [16:41:19] I tentatively believe so - I think it is specifically triggered when doing compaction read and writes [16:41:25] and they just didn't have a better name for it [16:41:40] but I definitely could be wrong [16:42:09] my 4th run, which is going now, is direct I/O with a 38GB cache (no compressed cache) [16:42:30] 38GB is the amount of free memory my system has at the moment, more or less [16:42:39] with a little wiggle room [16:43:00] if the performance goes up, then we know the kernel is just beating the rocksdb cache because it is using more memory [16:43:13] if it doesn't, then the kernel is doing some wizardy with a better algorithm or something [16:44:46] the default algorithm in RocksDB is just LRU [16:44:53] but it has some other algorithm called CLOCK [16:44:56] I'll try that too [16:45:03] LRU is tough to beat in reality from my experience [16:52:37] jimharris, sethhowe: we merged the qemu changes for vhost-blk, so do we need to update the qemu source on wkb-fedora-08? [16:52:48] I forget what the process is for updating that [16:55:42] I assume so. You mean that you made changes to spdk/qemu.git right? [16:56:15] yeah - we need to update the source and build it [16:56:42] we don't rebuild qemu on each patch (since there's no reason to) [16:56:58] I can update it. [16:57:04] 38GB cache is only marginally better than 16GB cache [16:57:13] I see the hit rate go up a bit, but only a few $ [16:57:15] % [16:57:24] btw - karol is going to start a jenkins job on their systems to test against qemu master as it evolves [16:57:26] and it ends up with like a 1% performance improvement [16:58:06] that mostly makes sense - the database is ~250GB [16:58:17] do I just pull from master? or is there a certain tag I should pull from? [16:58:54] the cache is somewhere around 10% of that size, so if you double the cache size I'd expect only a small gain in hit rate [16:59:07] for a truly random sample [17:00:18] are you running on P4800X? [17:00:23] P3700 [17:00:36] I have a P4800X, but Vishal was using a P3700 so I'm matching that for now [17:00:37] sethhowe: the qemu copy on that machine should already be on the 'spdk' branch [17:00:47] so you should be able to just do a git pull [17:00:58] I think you also need to configure it and do a make install, but I don't remember the details [17:01:24] that's fine - just wondering about the ~250GB size - did you shrink it from the default 500GB run_test.sh to speed up test runs? [17:01:36] yes - well Vishal did and I'm matching him [17:01:51] makes sense [17:02:11] there are so many knobs that this is going to take a few days to sort out [17:03:04] sethhowe: check out the qemu build code in test/vhost/fiotest/common.sh to see how we should configure it [17:03:12] qemu_build_and_install() [17:03:58] not sure what $INSTALL_DIR is meant to be there, but you should be able to find where the current version is installed [20:15:24] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [21:14:41] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Quit: Leaving.)