[01:00:46] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.77) (Quit: Leaving) [01:12:10] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [01:56:22] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [02:09:43] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [02:14:48] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 276 seconds) [02:15:27] *** Joins: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) [03:56:46] *** Quits: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) (Ping timeout: 255 seconds) [03:58:06] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [05:57:20] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [08:10:29] Hi everyone, if there are no objections, after the current running build finishes I am planning on putting the build pool on hold for about 10-15 minutes to move the vm's to a different physical machine. [08:22:50] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [08:50:03] build pool is back up and running. [08:53:36] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [09:21:55] jimharris: I removed your +2 from this since Pawel reviewed it -1: https://review.gerrithub.io/#/c/365327/ [10:43:42] drv: sounds good [10:50:05] jimharris, FYI I updated the mock patch to include support for * stubs as well and added an example of one. Let me know if you see a cleaner way of doing it... [11:06:49] jimharris: any insight into what might be causing this hugepage problem on wkb-fedora-04? [11:16:06] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [11:22:55] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 240 seconds) [11:23:35] *** Joins: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) [11:46:52] drv: i have no idea why it is all of sudden happening all of the time - can you try that patch where we block in setup.sh until all of the pages are available in /sys/kernel/mm/hugepages/hugepages-XXX/free_hugepages? [11:47:06] maybe also have a bail out for now in case they never become free [12:10:15] same/similar hugepage issue I'm seeing but you see yours on env.sh? [12:30:40] *** Quits: ChanServ (ChanServ@services.) (shutting down) [12:35:33] *** Joins: ChanServ (ChanServ@services.) [12:35:33] *** card.freenode.net sets mode: +o ChanServ [12:37:58] *** ChanServ sets mode: +o peluse [12:49:49] sethhowe, everything OK with CI? I have a failed patch from 24 min ago that still doesn't look like it has logs available yet... https://review.gerrithub.io/#/c/362847/ [12:52:44] peluse: your patch just needs to be rebased (missing 27becf40d0) [12:53:18] the copy process for ci.spdk.io takes a long time now since we have tons of outstanding reviews - we should look into reducing the size of each build [12:53:47] if you are on the VPN, you can get the build output before the copy is finished: http://spdk.intel.com/public/spdk/status/ [12:54:47] drv, I rebased right ebfore pushing [12:56:36] my failure is on the ubuntu system: [12:56:36] make clean left the following files: [12:56:36] + git status --porcelain [12:56:36] ?? OLD_STDERR [12:56:36] ?? OLD_STDOUT [12:56:37] + exit 1 [12:57:17] don't believe that's me and/or related to a rebase, no? [12:58:02] that is the thing fixed in 27becf40d0 [12:58:06] if you rebase, it should be fine [12:58:40] lcov on ubuntu leaves some files lying around [12:59:19] got it, thanks [13:02:46] drv, so that change was already in my patch [13:05:35] double confirmed it so I'm *pretty sure* :) [13:05:47] peluse: your readme change is based on 0c25463da05a6d3c9c99f4d24b3f0c4b2ffbbe0b, which is before 27becf40d08a87cf2577e1e5bf0f299e58e0743c [13:05:57] at least the version that's pushed to Gerrit [13:06:32] the unit test stubs patch is fine - that one is already rebased on something new enough [13:07:32] I just pulled patch set 11 from the gerrithub download link and its in there. Plus when I rebase it doesn't do anything because there's nothing for it to do, where are you looking for the info you just mentioned? [13:08:12] if you look at the "parent" field on gerrit, you can grab that and do a 'git log' and see what was before that point [13:08:29] you can also do a simple rebase in the gerrit ui if you prefer (hit the Rebase button and check "change parent revision", leave the field blank [13:09:12] drv, well, Id like to know what I'm doing wrong from the cmd line I guess. Why would a git pull from the download link provided for patch set 11 give me the correct code? [13:10:01] no idea - does the version you grabbed have 0c25463da05a6d3c9c99f4d24b3f0c4b2ffbbe0b as the previous commit? [13:10:18] I just downloaded it with the git fetch command and it does [13:11:22] I suspect the 'git pull' version will auto-merge it on whatever you have checked out currently [13:11:34] I always use the 'git fetch' version since you get exactly what was pushed to gerrit [13:12:14] ahhh [13:12:30] also because it's the first thing in the list :) [13:12:59] yeah - git pull is meant more for merging someone else's commit into whatever you are currently working on [13:14:09] OK, I think I'm good now. Will see... [14:03:07] jimharris: I tried adding a loop to wait for free_hugepages, but it didn't work [14:03:22] http://spdk.intel.com/public/spdk/builds/review/77fb0b54abd8ce4e251c91fa663c8fd3b6bd4d5c.1498510430/wkb-fedora-04/build.log [14:06:42] wtf [14:11:54] drv, are there suppsoed to be a bunch free already just before that loop? Maybe print before trying to get some and see when the last time they were avaialable was? [14:12:36] there are 0 free the whole time, so I suspect the stub from the previous run hasn't released its hugepages [14:13:12] is there a way to change the rocksdb test machine to not use the stub easily? [14:13:37] what "stub" do you mean? [14:14:42] test/app/stub - it uses the multi-process support to initialize the DPDK hugepages and NVMe devices once to speed up the tests [14:14:51] I pushed a patch to try it without stub to see if that makes any difference [14:15:10] ahh [14:16:02] drv: just change test/blobfs/rocksdb/rocksdb.sh to not use the stub [14:16:28] that's the only thing running on wkb-fedora-04, and wkb-fedora-04 is the only system that uses it in the test pool [14:16:44] yeah, I just pushed a patch to do that [14:18:16] I'll try that in nvme.sh on my system here if you think its worth the effort [14:19:27] the NVMe multi-process tests (everything inside timing_enter multi_process/timing_exit multi_process) is doing something equivalent, so that's probably worth a try [14:19:49] you could also try removing stub from nvme.sh [14:20:17] all the tests should still work if you just comment out the start_stub line in nvme.sh [14:20:23] running it that way now [14:21:51] jimharris: just a heads up, the qemu source on wkb-fedora-08 has been updated to the latest commit and is working properly. [14:21:55] kill_stub too [14:22:03] ok, the stub-removal patch failed on the rocksdb machine on the first run, which is expected, since the previous run was with the stub [14:22:06] yeah, I commented that out also [14:22:17] so I'll re-run this two more times to see if it actually works on a clean run without the stub [14:24:16] wow, w/o the stub I pass the multi and the vtophys for the first time (without adding delays) [14:24:22] will re-run a few more times... [14:24:24] cool, so we can blame it all on jimharris ;) [14:24:28] who wrote that stub app? [14:25:29] do we have to do something special to clean up a primary process? [14:25:36] it doesn't seem like this has been hitting NVMe multiprocess tests [14:25:44] not sure what the stub is doing differently [14:25:54] unless it's just not actually getting killed? [14:26:06] stub app allocates all of the memory [14:26:31] because we don't know for sure how much memory a second app might need [14:26:59] oh, the NVMe ones are explicitly requesting less than the max - hmm... [14:27:11] yeah [14:27:27] would it work if we just requested the max minus one page or something to get past that DPDK check for 0 free hugepages? [14:27:30] maybe add a ps aux dump? [14:27:41] it seems to sort of clean up after itself somehow [14:28:20] need to go look at that more - the huge-unlink option might work here too [14:28:37] also odd that it doesn't seem to happen on any of the other machines [14:28:41] ps aux just to see if the stuff is still running [14:28:46] stuff => stub [14:29:31] 2 passes w/o using stub... [14:33:08] 3 [14:33:40] so I can test something in about 4 min if you have anything you want to try, failed every time for me before. Otherwise I'm going to move on to the next test on this system [14:35:09] we should try to figure out what the deal is with stub, since it speeds up the tests a lot [14:39:19] happy to help but would be good to get a quick phone call to explain how/why it works the way it does (or doesn't I guess) :) [14:46:10] Running spdk fio_plugin with 2 and 4 NVMe drives. Unfortunately, performance is not scaling going from 1(460K) to 2 NVMe drive (460K-470K max). This is all on single core. No problem in scaling when running with multiple cores. [14:49:00] i'm at oracle so can't do a call until tomorrow [14:52:43] jimharris, OK cool. yeah would be good to get a better mental picture of the details [15:01:48] the rocksdb run without stub passed twice in a row, so it looks like it is stub-related as well [15:02:40] Looks like the core which is running FIO, is spending 70% time in 35% (nvme_pci_qpair_completions), 22% (spdk_nvme_qpair_process_completions), 13% (spdk_fio_getevents). Each drive has its own FIO thread, and there is a system call involved to get time intervals. Schedular context switching among threads may seem to hamper performance with more than 1 drive. [15:05:51] same core running with perf has completely different profile. Was testing this just to see if its a good idea to start using just FIO (not perf) for performance testing. [15:07:24] drv, cool [15:10:27] vermavis: can you specify multiple filenames in one job with fio? [15:18:11] jimharris: Thanks. QD=128, mutliple filenames with 1 job. So 1 FIO thread running. 460K (1 drive), 850K (2 drives), 1M (3 drives) and 1.1M (4 drives). It seems 20% this core is waiting (io_u_queued_complete(10%), wait_for_completions(10%)) [15:18:52] so before you just had 1 filename per job? [15:18:59] yes [15:22:49] I remember John K., also got something like this (1.2M) with 24x NVMe drives using spdk fio plugin on single core. [15:26:48] jimharris: I had 1 filename per job along with cpumask=0x1 to each job to ensure all these threads are running on single core. If we don't use cpumask then we see performance scaling but of course at cost of using additional core(s) [15:59:21] FYI this didn't fix things for me w/stub enabled again but I did notice in the DPDK doc that we should set sysctl -w kernel.randomize_va_space=0 which mine was not by default, I think it was 2 [16:00:05] I think we'd be seeing a different error if KASLR were interfering (we set --base-virtaddr to try to work around that) [16:12:25] *** Joins: ziyeyang_ (ziyeyang@nat/intel/x-qzpnbluoymsvvema) [16:16:36] *** Quits: ziyeyang_ (ziyeyang@nat/intel/x-qzpnbluoymsvvema) (Client Quit) [17:40:13] drv, FYI I changed things around a little so that stub exits gracefully when told by autotest.sh but that didn't make any difference, well unless I did something wrong of course. will chat more ya about it tomorrow. Daugher turns 17 today, goota run! [17:40:20] cool [18:15:54] how about kill_stub remove the spdk map entries in the hugetlbfs mount? [18:16:46] hmmmm - so one recent difference is that I changed the stub app to build on the event framework instead of just env layer last week [20:52:53] *** Joins: guerby_ (~guerby@ip165.tetaneutral.net) [20:56:03] *** Quits: guerby (~guerby@april/board/guerby) (Read error: Connection reset by peer) [22:13:21] *** Joins: ppelplin_ (~ppelplin@192.55.54.44) [22:13:59] *** Quits: ppelplin (~ppelplin@192.55.54.44) (Ping timeout: 260 seconds) [22:13:59] *** ppelplin_ is now known as ppelplin [22:17:03] *** Quits: pzedlews (~pzedlews@192.55.54.44) (Ping timeout: 260 seconds) [22:17:13] *** Joins: pzedlews (~pzedlews@192.55.54.44)