[00:16:02] *** Joins: tomzawadzki (tzawadzk@nat/intel/x-nmjgyarsgqbilkpx) [01:00:26] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.72) (Quit: Leaving) [02:01:26] *** Quits: cunyinch_ (~cunyinch@134.134.139.82) (Remote host closed the connection) [02:45:45] *** Quits: whitepa (~whitepa@2601:601:1200:f23b:510d:fd92:df29:a7ba) (Read error: Connection reset by peer) [04:07:21] *** Quits: tsuyoshi (b42b2067@gateway/web/freenode/ip.180.43.32.103) (Quit: Page closed) [08:31:01] *** Quits: tomzawadzki (tzawadzk@nat/intel/x-nmjgyarsgqbilkpx) (Ping timeout: 258 seconds) [08:53:36] the i/o counting patch didn't fail because of calsoft - i misread the log and contents of the results directory [08:53:47] it was a bug in the error vbdev module [08:54:20] it is getting a new bdev_io via get_child_io() but then calling resubmit on that new I/O [08:55:00] because we don't have an API available to the error injection module to submit this child_io as a "new" I/O [08:55:21] for now i'm just fixing the error module to just fail the I/O immediately (as drv had suggested in passing last week) [08:55:54] yeah, I think we can just chop out all of the child I/O stuff [09:22:33] jimharris: were you going to add a SPDK_TEST_ flag for unit tests? if not, I can put that together [09:23:32] i was, but if you want to throw it together i'm ok with that too :) [09:24:26] ok, I will make a patch [09:25:07] yeah - let's get that in asap and hold off on committing patches until we get that in - it will actually save us about 40s per patch [09:26:02] it was 30 on my system but looks like 40-45s on fedora-07 (currently the long pole in the tent among the test systems) [09:27:33] https://review.gerrithub.io/#/c/362390/ - I already set the flag to 0 on wkb-fedora-07 [09:28:56] need that reorderable build queue [09:29:07] i'd go ahead and remove it from fedora-04 too [09:29:55] ok [09:34:43] have we started at all on the dynamic patch queue? [09:35:14] no - he's working on putting the agents in VMs right now [09:35:24] so we can quickly deploy lots of agents [09:35:32] cool [09:35:48] also he's writing unit tests [09:35:52] for the real code [09:35:53] i was just curious - this is a case where bumping drv's patch to that front would be really nice [09:35:57] yeah - i saw that [09:36:06] I agree - reorderable queue would be awesome [09:37:20] we could actually do it by hand-editing the file righr now [09:37:27] do it :) [09:37:31] I think... [09:38:14] well actually that wouldn't help that much - the patches already in the queue wouldn't skip the unit tests [09:38:34] but for the ones we own we can rebase them real quick if we merge [09:38:37] sure [09:38:52] if we rebase before they run, they'll run the rebased version instead [09:38:55] the release builds in the queue would still run long [09:38:56] it's smart enough to work all of that out [09:38:59] yeah [10:58:36] jimharris, so my vagrant VM doesn't have an NVMe device for sure, I'm assuming yours does? When I run on my Mac it is there. Wondering if there's a version and/or syntax thing with the addition of the NVMe device in Vagrantfile. Or maybe my vbox ext versions, will check... [11:00:11] mine does but it's also Mac - I haven't tried Linux yet [11:04:14] hmmm, my linux vbox is older than my max vbox. I'll update and see if it works [11:42:48] bwalker, drv: could you review my nvme perf latency patches? i'd like to get those committed before i write up info on my histogram approach for girish [11:45:35] reviewed. there are comments on the second one [11:47:58] drv: do you want me to respin to add 'const' to that cutoff array? [12:29:50] Hey, I just realized that you guys have make dpdk a submodule of spdk in gerrithub [12:31:30] This is going to make things a little more complicated. [12:32:09] What are we supposed to do if we have patches in DPDK that we need to get our SPDK changes to work? [12:32:38] This means I need to upstream my patches to DPDK as well as my Changes to SPDK [12:48:28] jimharris: if it's not too much trouble, I think it would be a good idea [12:48:53] johnmeneghini1: we still support building with out-of-tree DPDK, so you should be able to point at your own patched version of DPDK [12:49:04] the submodule is just to make it easier to get a known-good version of DPDK that works with the current version of SPDK [12:51:52] Agreed. [12:53:45] Also, what does it mean that "gerrithub automaticlly updates your github project"? [12:54:06] While setting this up I imported my github spdk fork. [12:54:19] At this point I'm thinking that was a bad idea. [12:54:24] if you use Gerrithub to merge patches (using the Submit button), they would get synced to your github project [12:54:32] but if you don't do that with gerrit, nothing will happen [12:55:11] if you just directly interact with the GitHub project and don't touch it on Gerrithub, it won't be a problem [12:56:26] So my github fork will be synced with patches that I merge into gerrithub but not that others merge? [13:00:27] jimharris, FYI w/latest vbox on linux I get the NVMe device in the vagrant VM and hellow works as expected including seutp.sh in Vagrantfile. Will update patch accordingly... [13:01:13] johnmeneghini1: your github fork won't be modified unless you push things via gerrithub [13:03:44] you can probably also delete your project on gerrithub if you didn't want to import it [13:06:32] OK. I think I'm going to remove my github fork on gerrithub. I maintain this fork manually by pulling from github/spdk and pushing to github/johnmeneghini/spdk. And I do plan to push things into spdk via gerrithub. I'm planning to move completely away from github. [13:09:08] just to be clear, the main spdk repo is called 'spdk/spdk' on gerrithub, and pushing things to that won't affect your johnmeneghini/spdk fork on gerrithub or github [13:09:30] but yeah, if you aren't going to use the gerrithub integration for your fork, I would just remove it to avoid confusion [13:10:19] Agreed [13:14:28] jimharris: your new latency patch has a dpdk submodule change in it: https://review.gerrithub.io/#/c/362270/ [13:14:35] probably not intentional [13:16:19] good grief [13:17:47] fixed [13:18:29] so another thing I noticed is the total build time gets reported as about 45s longer than the longest of the individual test systems [13:18:36] http://spdk.intel.com/public/spdk/builds/release/master/2219/ [13:18:54] hm, that is quite a bit [13:19:27] some of that is probably from copying the source tree over sshfs, but I can't imagine it takes 45 seconds [13:24:35] Is there some way I can download a patch set that I am reviewing in gerrithub? It's hard to review some of these changes w/out being able to see the whole bundle. I'd prefer to download a patch and use cscope. [13:25:13] in the upper right corner, you should see a menu called "Download" [13:25:32] there are several options there for how to download it [13:25:49] personally i usually use the Checkout option [13:26:53] OK. Thanks [13:30:46] ooh, my daily reminder just went off to harass bwalker about setting up an office hours time... :) [13:40:07] jimharris, thanks for pointing that out. I'm going to analyze this next build and see where the hangup is. [13:42:36] I vote that we schedule our office hours in Swatch internet time [13:42:52] what does that mean [13:52:09] http://lmgtfy.com/?q=swatch+internet+time [13:52:57] It looks like we lose about 10 seconds while the pool and agent communicate back and forth getting ready to start the build(There are a few intermediate states). Still looking for other causes. [13:53:06] they taught me that in school :) [13:54:02] it's a useful life skill [13:54:25] bwalker, one more try on that timeslot please.... see email [13:56:35] *** Quits: johnmeneghini1 (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Quit: Leaving.) [13:57:40] like even an hour earlier would be good, or that time on W or F [14:06:34] 9am is pretty early - I'm not always caught up on email and done putting out fires by then [14:06:44] and 11-12 bleeds into lunch [14:06:54] so I want it at 10am [14:07:17] but you have something every day at 10am except wed. [14:07:27] while the rest of us have nothing at 10am every day except wed [14:29:40] I can do Mon [14:30:03] does that work? [14:33:20] yeah monday works [14:33:34] great, thanks [14:37:55] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [14:38:17] Ben and Daniel [14:38:38] I'm looking at https://review.gerrithub.io/#/c/362255/2/lib/util/io_channel.c [14:39:04] I have a comment, bit it appears to be stuck in the draft state. [14:40:02] to publish comments, you have to hit Reply... at the top level and Post [14:40:17] you can also do +1/-1 depending on what you think of the patch [14:40:20] yeah - if you go to the top level of the review (up button if you are looking at a file) [14:40:28] at the top is a "Reply..." button [14:40:37] you get to vote and it publishes all of your comments at once [14:49:12] OK. Thanks. You've got my first comment, but I'm now done for a while. I'll be offline for the next few days. My daughter is getting married this week end! [14:50:13] oh congrats [15:01:18] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [15:12:53] nice! [15:20:48] jimharris: I'm investigating the unit test failure from this patch https://review.gerrithub.io/#/c/362254/ [15:20:55] the patch should be totally unrelated to the failure [15:21:21] and I've made some progress - namely that it is timing related [15:21:36] if the code runs slower, because for instance thread sanitizer is on, it hits [15:21:39] but if it runs fast it doesn't [15:21:59] so then I dug into how it is doing the actual message passing [15:22:37] that's the first time I've seen that failure [15:22:39] and am I correct in understanding that it's assuming the blobfs operations will each submit at most queue depth one worth of requests? [15:23:15] yes [15:23:18] because I made a quick modification where in ut_send_request and in spdk_thread I set g_req to NULL [15:23:23] when I'm done with it [15:23:29] i.e. whenever req->done is set to 1 [15:23:38] and then in send_request, I assert that g_req is NULL [15:23:45] if I run normally, it passes [15:23:55] if I turn on thread sanitizer or put in delays [15:23:56] it fails [15:24:08] so two reqs are being submitted [15:24:22] specifically, from spdk_file_truncate [15:24:24] hold on - i'll need to take a look [15:26:25] where are two requests getting sent from truncate? [15:26:36] that's just the stack when I hit the assert [15:27:04] the rest of the stack goes up to cache_write, line 154 [15:27:17] so there could be an I/O outstanding from spdk_fs_open_file maybe? [15:27:24] but spdk_file_truncate only calls ->send_request once [15:27:26] I'm trying to make sure that my asserts are all correct [15:29:51] yeah - when truncate calls send_request [15:29:58] there is still an __fs_open_file req out [15:30:16] triple checking that I'm doing the right thing with my asserts here [15:30:44] to be honest I need to rethink how this blobfs_sync_ut code works - I wanted to run it all without relying on the spdk framework, but my "solution" isn't very ideal [15:31:01] yeah - thread sanitizer thinks it is totally broken [15:31:05] so does valgrind [15:31:08] but it's not assuming x86 atomicity rules [15:31:24] i can take a look at this tomorrow [15:31:42] btw - i have a new idea to reduce the iscsi (and probably nvmf) tests dramatically [15:31:54] i've tested it out locally and it works - needs a few patches to get it ready [15:32:13] the basic idea is to create a dpdk "stub" app that basically just does the spdk_env_init() as a primary process [15:32:18] also probes nvme devices [15:32:32] then all of the iscsi tests just pass an instance id to bind to that stub process [15:32:40] so we avoid dpdk init time + nvme probe time for all of those tests [15:33:26] that also allows us to be much more granular on our system-level automated tests [15:34:07] I like it [15:38:01] well, my assert was very slightly wrong [15:38:26] in the spdk_thread, I was setting g_req to NULL after I called req->fn [15:38:41] but req->fn kicks the semaphore, so the main thread moves on and tries to call the next I/O [16:01:04] jimharris: I did a little more looking into the timing issues. There is about a fifteen second latency before all of the agents get to the point of running the builds, and the build finishes very quickly after the last agent finishes. [16:02:45] I think a lot of the disparity is coming from the fact that I am using the data generated by the timing function in autorun to generate the agent time while I am measuring the pool time directly from python. [16:04:17] I will change the scripts to measure agent time directly so that we get a better idea of real issues slowing down the builds. [16:09:21] sethhowe: thanks! some of these issues might not be fixable - hope you don't mind sharing what I find to see if there are easy things we can do to reduce the test times [16:14:52] Not at all! Keeping test times down will go a long way towards improving development turnaround times. [16:21:23] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [18:16:01] *** Joins: ziyeyang_ (~ziyeyang@134.134.139.77) [21:59:00] *** Joins: changpeng (changpeng@nat/intel/x-fgtafscqyasqtbjv) [22:00:09] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Quit: Leaving.) [22:49:14] *** Joins: tsuyoshi (b42b2067@gateway/web/freenode/ip.180.43.32.103)