[00:17:10] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [00:18:29] *** Joins: gila_ (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) [00:21:51] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 268 seconds) [00:32:43] *** Quits: gila_ (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [00:56:49] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [00:57:52] *** Joins: gila_ (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) [01:00:49] *** Quits: ziyeyang__ (ziyeyang@nat/intel/x-mbjjsosaqwivnagn) (Quit: Leaving) [01:01:13] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 260 seconds) [01:01:33] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.74) (Quit: Leaving) [07:11:51] jimharris, so I just realized that if I run everything clean from master now I'm back to hanging in the multi-process loop with a few instances of perf running stuck in rte_spinlock_lock() which was my original issue with these tests that went away with gang's patch a little while ago... [07:12:23] here's the bt for one of them: [07:12:23] #0 0x000000000047cf19 in rte_spinlock_lock (sl=0xffffffa1c) at /home/peluse/spdk/dpdk/build/include/rte_spinlock.h:55 [07:12:24] #1 0x000000000047d536 in malloc_elem_free (elem=0x107b332fc0) [07:12:24] at /home/peluse/spdk/dpdk/lib/librte_eal/common/malloc_elem.c:277 [07:12:24] #2 0x000000000047c8ca in rte_free (addr=0x107b333000) at /home/peluse/spdk/dpdk/lib/librte_eal/common/rte_malloc.c:61 [07:12:25] #3 0x000000000045998b in spdk_dma_free (buf=0x107b333000) at env.c:90 [07:12:27] #4 0x000000000041603b in unregister_controllers () at perf.c:1255 [07:12:29] #5 0x0000000000416d74 in main (argc=13, argv=0x7ffd97793148) at perf.c:1422 [07:13:00] here's the other [07:13:00] #0 0x000000000047cf19 in rte_spinlock_lock (sl=0xffffffa1c) at /home/peluse/spdk/dpdk/build/include/rte_spinlock.h:55 [07:13:00] #1 0x000000000047d536 in malloc_elem_free (elem=0x107a40dfc0) [07:13:00] at /home/peluse/spdk/dpdk/lib/librte_eal/common/malloc_elem.c:277 [07:13:00] #2 0x000000000047c8ca in rte_free (addr=0x107a40e000) at /home/peluse/spdk/dpdk/lib/librte_eal/common/rte_malloc.c:61 [07:13:02] #3 0x000000000045998b in spdk_dma_free (buf=0x107a40e000) at env.c:90 [07:13:04] #4 0x0000000000445845 in nvme_pcie_qpair_destroy (qpair=0x107a5f1cd0) at nvme_pcie.c:1174 [07:13:06] #5 0x000000000044749c in nvme_pcie_ctrlr_delete_io_qpair (ctrlr=0x107c87a700, qpair=0x107a5f1cd0) at nvme_pcie.c:1499 [07:13:09] #6 0x0000000000432e30 in nvme_transport_ctrlr_delete_io_qpair (ctrlr=0x107c87a700, qpair=0x107a5f1cd0) [07:13:11] at nvme_transport.c:158 [07:13:13] #7 0x000000000041b376 in spdk_nvme_ctrlr_free_io_qpair (qpair=0x107a5f1cd0) at nvme_ctrlr.c:236 [07:13:15] #8 0x00000000004119c1 in cleanup_ns_worker_ctx (ns_ctx=0x1fa1270) at perf.c:611 [07:13:17] #9 0x0000000000411dd2 in work_fn (arg=0x1f32cc0) at perf.c:663 [07:13:21] #10 0x0000000000416cfb in main (argc=13, argv=0x7ffec0888c18) at perf.c:1413 [10:19:42] *** Joins: tsg_ (tgohad@nat/intel/x-obakuabzyytdieov) [10:22:53] *** Quits: tsg_ (tgohad@nat/intel/x-obakuabzyytdieov) () [13:08:21] heads up, I will be pausing the build pool for a moment to reconfigure the boot settings in wkb-freebsd-0. [13:10:30] Q: why in many cases in a good build.log we see the no free hugepages option but not the DPDK fatal message about not being able to init? For example: /home/sys_sgsw/build_pool/agent/repo/test/lib/bdev/bdev.conf [13:10:30] Starting DPDK 17.05.0 initialization... [13:10:30] [ DPDK EAL parameters: bdevtest -c 0x3 --file-prefix=spdk_pid80728 ] [13:10:30] EAL: Detected 32 lcore(s) [13:10:31] EAL: No free hugepages reported in hugepages-1048576kB [13:10:42] but then that test runs just fine... [13:11:38] peluse: that's normal - there are two possible hugepage sizes on x86, 2 MB and 1 GB [13:11:53] the message you're seeing just says that there are no 1 GB hugepages available [13:12:04] (we only set up 2 MB pages) [13:12:17] ahh, OK. so that's just saying it couldn't get 1GB pages? [13:12:34] yeah, the fact that it is printed even when things are working is a little confusing, but it's normal [13:13:10] got it, thanks. continuing to mess with the multi-process tests in nvme.sh. Man, they seem to be overly fragile [13:13:24] but having fun :) [13:16:45] Test pool is back online. [14:39:22] my brain is offline. [14:39:45] have you tried turning it off and on again? [14:40:25] speaking of, I'm going to go get some coffee [14:46:04] I might have to go straight for something similar but illegal.. [14:47:17] I didn't know they still made Four Loko ;) [14:52:40] heh... messing with some of the sample apps in DPDK and had one problem after another on Fedora, compiled right away for me on Ubuntu [15:47:32] bwalker, drv not that its a huge priority but its been open for a while... readme patch has all of the latest comments addressed ;) [15:49:25] mock macro deal is also up to date... [16:00:08] *** Quits: guerby (~guerby@april/board/guerby) (Ping timeout: 240 seconds) [16:01:58] so one of my problems in nvme.sh seems to be there in the build pool as well but for some reason the test fails but doesn't bail out of nvme.sh. See https://ci.spdk.io/builds/review/23a3261b95daede26193d9fef2385055a33a8e8e.1498509263/wkb-fedora-03/build.log [16:02:46] its that same while loop at the end and I think the secondary is coming up before the primary is ready, I've seen that on mine and might explain why a delay between perf and identify fixes it for me [16:03:12] so first question I guess is why is the failure in the log above not causing nvme.sh to bail out? [16:03:35] search for "EAL: FATAL: Cannot init memory" and you'll find it quick [16:06:58] it's not exiting the script because the programs are launched in the background (&), so 'set -e' doesn't affect them [16:07:13] it does look like it is busted, though [16:07:27] *** Joins: guerby (~guerby@april/board/guerby) [16:07:41] jimharris: I mentioned earlier that async RPC responses should work, but actually I don't think they are correct in the current jsonrpc_server_tcp implementation [16:08:09] yeah, I think so too. I'll keep poking at it, have a few more trial things I want to do to get a better understanding of whats happenin' will keep you posted [16:08:35] jimharris: there is a single buffer per connection that is used for both the request and response, which can't be used for a response if we're in the middle of reading a request [16:08:44] (or writing another unrelated response) [16:09:01] peluse: I'll take a look at your patches in a bit too - just finishing up the vhost-blk review [16:09:17] drv, thanks, no hurry just didn't want them to be forgotten :) [16:09:42] jimharris: but I do think we should make it support async responses - it will just need to allocate a buffer on the fly for responses, or something similar [16:58:13] pretty sure what I mention above has been the root of all my problems. If we start any secondary and it come sup before the primary is ready, one of 2 or 3 bad things happens and depending on which one, vtophys fails with the huge page thing [16:58:44] the wait for $stubpid is totally required and w/o it, its a race as to whether the first perf is primary or seconday [16:59:17] ok, I think that wait $stubpid patch should definitely go in [16:59:23] and without similar coordination, or just delays like I've tested, between the perf and identify that are run in the background the same thing happens but the exact behavior depends on whether stub was still alive when the first perf ran or not [16:59:42] I keep misreading stubpid as stupid [16:59:48] LOL [17:00:05] with all of these things in, it works every time for me [17:00:13] with any one of them not in, its intermittent at best [17:00:33] how did you do the second one? is that just a fixed-length sleep? [17:00:48] yeah, not super deterministic but good enough to figure out what was happening [17:01:15] we could probably wait for the /var/run/spdk_stub%d file to show up [17:01:20] we can do something fancier obviously but don't want to put to much engr time into coordinating 2 of our examples apps to run back to back I don't think :) [17:01:24] but a sleep is a good starting point [17:01:43] drv, yeah I'll try that tomorrow. That could work [17:02:01] been testing all these on 2 machines, one ubuntu and one fedora and now can make them behave the same every time [17:02:30] ...and it only took building and paying with the DPDK simple_mp example to figure it out :) [17:03:26] drv, I can guess I can look real quick but do the perf and/or identify apps drop a sentinel file like spdk_stub when they start (or arb), if not that won't work [17:04:11] oh yeah, the other apps don't have that equivalent file, so it won't work for multi_process in nvme.sh [17:04:16] so a sleep is probably as good as it gets for now [17:04:17] doesn't look liek it but easy enough to add [17:04:35] either way :) tomorrow... [17:04:39] sounds good [17:49:39] *** Joins: ziyeyang_ (ziyeyang@nat/intel/x-fuxlgjrzawxmlkcq) [18:00:54] *** Quits: ziyeyang_ (ziyeyang@nat/intel/x-fuxlgjrzawxmlkcq) (Remote host closed the connection)