[01:29:28] *** Quits: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) (Ping timeout: 260 seconds)
[01:31:36] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[03:16:40] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[03:30:20] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[04:17:05] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[08:23:51] <peluse> drv, I assume you meant to change all the unit test subdir names from _c or .c (like in nvme, etc)? Just making sure, a dir name with a .c at the end threw me for a second :)
[09:17:12] <drv> yes, I did the initial one that way and nobody commented on it, so I assumed it was OK
[09:26:58] <jimharris> sethhowe: any idea on this one?  http://spdk.intel.com/public/spdk/builds/review/37e3f5b00a62d8178ce5cf424d8ffa346522bba6.1498147155/
[09:27:10] <bwalker> they are talking about it behind me right now
[09:27:32] <jimharris> i was monitoring it during the run, and there was no output in the vm results directories
[09:27:41] <jimharris> but it's there now
[09:32:07] <drv> those VMs are running on a machine in the other lab, so I think the sshfs connection probably went down temporarily
[09:32:48] <drv> jimharris: I pushed your DPDK patch, so you should be able to do a submodule update now
[09:33:03] <jimharris> ok
[09:40:33] <jimharris> pool is hung up again
[09:40:46] <jimharris> looks like all of the test systems are done
[09:40:51] <jimharris> but status still says incomplete
[09:46:33] <drv> sethhowe was scrolling around in the tmux session running the pool, so it was blocked
[09:47:32] <drv> jimharris: your submodule update has the wrong commit, I think - did you use the local commit before gerrit rebased it?
[09:47:47] <sethhowe> My bad on that one. lesson learned.
[09:49:32] <jimharris> np
[09:49:40] <jimharris> test pool still looks like its stuck
[09:49:48] <jimharris> not starting the next patch in the queue
[09:49:58] <jimharris> drv: thanks for the heads-up - fixed
[09:50:59] <sethhowe> It is. It looks like the submodule hash for dpdk on the current build doesn't exist. My error handling is not currently equipped to deal with that. It thinks it is a problem with the gerrit server and is trying to rerun the build over and over.
[09:54:27] <sethhowe> I will add error handling for that specific case.
[09:56:58] <drv> jimharris: your stub patch has a typo in the commit message :) ("sentinal")
[10:44:15] <jimharris> drv: thanks - i've found another issue i need to fix in that patch so will update the commit message while i'm at it
[10:45:25] <jimharris> just using the env layer, the thread was not bound to a core so the scheduler would keep that sleep(1) loop on an idle core
[10:46:19] <jimharris> sethhowe: thanks - i'll try not to break the test pool with bad submodule git commit ids :)
[12:06:38] <peluse> drv, OK, just thought maybe my file system was corrupted or something there at first (or maybe I had taken a sleeping pill this morning instead of a vitamin) :)
[13:09:36] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[13:14:06] <jimharris> drv: i made a few cosmetic changes to the nbd patch
[13:14:28] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 260 seconds)
[13:15:26] <jimharris> drv, sethhowe: noticing there seems to be quite a bit of difference lately between the max agent test time and the time reported by the pool
[13:15:40] <jimharris> 63 seconds: http://spdk.intel.com/public/spdk/builds/release/master/2462/
[13:16:02] <jimharris> 79 seconds: http://spdk.intel.com/public/spdk/builds/review/fcb514479d007d7d625cad6e24f9a374f134c6a0.1498160996/
[13:16:13] *** Joins: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com)
[13:16:14] <jimharris> we're thinking most of that is related to the sshfs copies, etc. right?
[13:16:40] <jimharris> it seems to wildly vary between test runs though
[13:16:46] <jimharris> this one was only 28 seconds: http://spdk.intel.com/public/spdk/builds/review/fce0e5a15599bcc43ec94183e2a94f87b13a7308.1498161860/
[13:17:51] <jimharris> drv: some of the failures on my stub series had the dpdk submodule change in it - i'm suspicious about switching the dpdk submodule for now
[13:22:31] <sethhowe> THanks Jim, Yes, that's still the belief. I have been meaning to take a deeper look at this for a while. I have been prioritizing configuring the virtual machines, but let me take a deeper look at this today as soon as I add the ubuntu 17.04 machine to the pool.
[13:25:32] <sethhowe> Also, did you all see my e-mail yesterday about the technical training? Did that idea seem good to all of you? I have gotten positive feedback from anu, peluse, and john. jimharris bwalker drv
[13:29:40] <jimharris> i may have to skip the technical training today
[13:44:13] <sethhowe> jimharris: I was looking back at those two builds you referenced, those two occured when I was testing a hunch about the timeout failure from the virtual machines this morning.
[13:45:32] <jimharris> ah - ok
[13:45:57] <sethhowe> when the host machine is being used for something else (in the case of those two builds I  was shell copying a disk image) it blocks the guest vm's from updating the sshfs drive.
[13:47:58] <sethhowe> But I did take the liberty of checking the timing differences from the other 15 passing builds on the webpage and there was still a lot of variance (22s min value - 44s max value).
[13:49:09] <sethhowe> I'll keep looking into the variances and see if I can pinpoint it. Thanks.
[14:57:53] <drv> hmm, this last hang seems to be in the new test that does iscsi_tgt + nvme bdev connecting to nvmf_tgt
[14:57:59] <drv> via config file
[15:05:06] <jimharris> i don't think it's related to my patch
[15:05:20] <jimharris> same thing happened on a previous test run on my nbd patch (which definitely could not have caused that)
[15:05:32] <jimharris> http://spdk.intel.com/public/spdk/builds/review/2ba0f3a62856432399c25f9ec15a2ac8e7df15cb.1498163200/
[15:06:44] *** Quits: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) (Ping timeout: 255 seconds)
[15:09:45] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[15:35:32] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[15:43:54] <jimharris> drv, bwalker: patch to revert the NVMe-oF conf file testing that are causing these test failures:  https://review.gerrithub.io/#/c/366696/
[16:16:56] <bwalker> jimharris: I have the rocksdb plugin ported to 5.4.5 - how should I push that?
[16:17:10] <bwalker> I can't really do a patch because it's a rebase onto a different history in git
[16:17:29] <bwalker> I could push a patch that's a diff between 5.4.5 tip and what would be our new spdk code
[16:17:34] <jimharris> ugh
[16:17:53] <bwalker> I also confirmed it fixes the XFS thing
[16:17:55] <jimharris> I think you just do a git reset
[16:18:07] <bwalker> I'm on kernel 4.11 and doing lots of rocksdb runs using it, no prob
[16:18:18] <bwalker> I can just force push the spdk branch to be my new one
[16:18:22] <bwalker> without review
[16:18:30] <bwalker> I just didn't know if you wanted to review the changes I made
[16:18:43] <bwalker> I don't think I did very much at all actually
[16:18:55] <bwalker> maybe nothing now that I look at it
[16:19:09] <jimharris> is it basically 5.4.5 + our minor patches?
[16:19:12] <bwalker> yeah
[16:19:15] <jimharris> like before?
[16:19:19] <bwalker> I just rebased the spdk branch onto 5.4.5
[16:19:27] <bwalker> I squashed our branch as I rebased
[16:19:36] <bwalker> it was too hard to make the individual patches each pass
[16:19:41] <jimharris> excellent - i meant to bring that up last night
[16:19:51] <jimharris> no - squashing was the way to go
[16:20:00] <bwalker> so it's just a single patch on top of 5.4.5
[16:20:05] <bwalker> do you want me to push that as a review?
[16:20:10] <jimharris> how many changes to our .cc file?
[16:20:19] <bwalker> I believe 0
[16:20:29] <bwalker> yes 0
[16:20:34] <bwalker> our.cc file is in the spdk repo
[16:20:36] <bwalker> so no changes at all
[16:20:49] <bwalker> I did have to change the makefile fragment in our repo to include env/ instead of util/
[16:20:51] <jimharris> so there's no synchronization needed between moving the spdk branch and changes to our .cc file
[16:21:11] <bwalker> I can push a patch that adds -Ienv/ to the spdk makefile fragment
[16:21:16] <bwalker> but leave -Iutil/
[16:21:21] <jimharris> good idea
[16:21:23] <bwalker> and then our makefile fragment will work with both
[16:21:33] <bwalker> I'll do that here in a minute
[16:21:47] <jimharris> if it's not too hard, sure I wouldn't mind reviewing the patch on top of 5.4.5
[16:22:04] <jimharris> gerrithub, text, whatever
[16:22:08] <bwalker> I'll make that happen - I think I'll have to force push the spdk branch to be 5.4.5, temporarily losing our changes
[16:22:13] <bwalker> then I'll push the patch on top of that
[16:22:24] <drv> we could make a new branch, the way we're doing dpdk
[16:22:24] <bwalker> I can name it spdk-5.4.5
[16:22:40] <bwalker> ok I'll sort that out and send the patches
[16:28:34] <bwalker> https://review.gerrithub.io/#/c/366698/ and https://review.gerrithub.io/#/c/366699/
[16:28:51] <bwalker> don't merge the rocksdb patch yet - I'm doing a ton of benchmarking so we may as well wait to make sure everything is solid
[16:29:35] <bwalker> I don't want to find out that we introduced an intermittent failure or something
[16:31:15] <jimharris> all looks good to me - go ahead and push as you see fit
[16:38:48] <bwalker> I've run a few rocksdb experiments for randread with just the kernel
[16:39:10] <bwalker> if I turn off the rocksdb cache and rely entirely on the kernel, I get 121k I/Ops
[16:39:15] <bwalker> Get() I/Ops
[16:39:43] <bwalker> if I switch to direct I/O with a 16GB block cache and a 16GB compressed cache, I get ~111k I/Ops
[16:39:54] <jimharris> compressed cache?
[16:40:07] <bwalker> it's a newish thing - it caches the compressed data from sst files
[16:40:14] <bwalker> but I believe it is only used for compaction
[16:40:19] <bwalker> so it isn't having an impact here
[16:40:31] <bwalker> in fact, my third run was direct I/O with 16GB block cache and no comrpessed cache
[16:40:39] <bwalker> same performance as with the compressed cache
[16:40:53] <jimharris> if we have compression turned off, is that compressed cache still used?
[16:41:02] <jimharris> we don't enable compression in our run_tests.sh script
[16:41:19] <bwalker> I tentatively believe so - I think it is specifically triggered when doing compaction read and writes
[16:41:25] <bwalker> and they just didn't have a better name for it
[16:41:40] <bwalker> but I definitely could be wrong
[16:42:09] <bwalker> my 4th run, which is going now, is direct I/O with a 38GB cache (no compressed cache)
[16:42:30] <bwalker> 38GB is the amount of free memory my system has at the moment, more or less
[16:42:39] <bwalker> with a little wiggle room
[16:43:00] <bwalker> if the performance goes up, then we know the kernel is just beating the rocksdb cache because it is using more memory
[16:43:13] <bwalker> if it doesn't, then the kernel is doing some wizardy with a better algorithm or something
[16:44:46] <bwalker> the default algorithm in RocksDB is just LRU
[16:44:53] <bwalker> but it has some other algorithm called CLOCK
[16:44:56] <bwalker> I'll try that too
[16:45:03] <bwalker> LRU is tough to beat in reality from my experience
[16:52:37] <drv> jimharris, sethhowe: we merged the qemu changes for vhost-blk, so do we need to update the qemu source on wkb-fedora-08?
[16:52:48] <drv> I forget what the process is for updating that
[16:55:42] <sethhowe> I assume so. You mean that you made changes to spdk/qemu.git right?
[16:56:15] <jimharris> yeah - we need to update the source and build it
[16:56:42] <jimharris> we don't rebuild qemu on each patch (since there's no reason to)
[16:56:58] <sethhowe> I can update it.
[16:57:04] <bwalker> 38GB cache is only marginally better than 16GB cache
[16:57:13] <bwalker> I see the hit rate go up a bit, but only a few $
[16:57:15] <bwalker> %
[16:57:24] <jimharris> btw - karol is going to start a jenkins job on their systems to test against qemu master as it evolves
[16:57:26] <bwalker> and it ends up with like a 1% performance improvement
[16:58:06] <bwalker> that mostly makes sense - the database is ~250GB
[16:58:17] <sethhowe> do I just pull from master? or is there a certain tag I should pull from?
[16:58:54] <bwalker> the cache is somewhere around 10% of that size, so if you double the cache size I'd expect only a small gain in hit rate
[16:59:07] <bwalker> for a truly random sample
[17:00:18] <jimharris> are you running on P4800X?
[17:00:23] <bwalker> P3700
[17:00:36] <bwalker> I have a P4800X, but Vishal was using a P3700 so I'm matching that for now
[17:00:37] <drv> sethhowe: the qemu copy on that machine should already be on the 'spdk' branch
[17:00:47] <drv> so you should be able to just do a git pull
[17:00:58] <drv> I think you also need to configure it and do a make install, but I don't remember the details
[17:01:24] <jimharris> that's fine - just wondering about the ~250GB size - did you shrink it from the default 500GB run_test.sh to speed up test runs?
[17:01:36] <bwalker> yes - well Vishal did and I'm matching him
[17:01:51] <jimharris> makes sense
[17:02:11] <bwalker> there are so many knobs that this is going to take a few days to sort out
[17:03:04] <drv> sethhowe: check out the qemu build code in test/vhost/fiotest/common.sh to see how we should configure it
[17:03:12] <drv> qemu_build_and_install()
[17:03:58] <drv> not sure what $INSTALL_DIR is meant to be there, but you should be able to find where the current version is installed
[20:15:24] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net)
[21:14:41] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Quit: Leaving.)