[00:21:13] *** Joins: mayank (6a3324e8@gateway/web/freenode/ip.106.51.36.232) [00:22:32] Hello [00:22:32] I am using spdk memory allocation with my application. [00:23:59] what i observe is, if i request for memory with alignment, it is taking too much time to allocate aligned memory [00:24:24] i tried by allocating more hugepages (only 1G hugepage), but same problem [00:24:55] Samples: 1M of event 'cycles:ppp', Event count (approx.): 772049010511 Overhead Command Shared Object Symbol 33.47% fio fio [.] malloc_elem_free 33.07% fio fio [.] malloc_heap_alloc 6.14% fio fio [.] malloc_elem_can_hold 4.91% fio libzpool.so.2.0.0 [.] init_ [00:24:55] snapshot from perf report [00:25:12] 33.47% fio fio [.] malloc_elem_free [00:25:17] 33.07% fio fio [.] malloc_heap_alloc [00:25:20] 6.14% fio fio [.] malloc_elem_can_hold [00:26:10] what could be the reason? [00:53:04] *** Quits: Aneesh (31cf3740@gateway/web/freenode/ip.49.207.55.64) (Ping timeout: 260 seconds) [01:32:08] *** Quits: sethhowe (sethhowe@nat/intel/x-tejkjkniqfbwehhb) (Remote host closed the connection) [01:32:26] *** Joins: sethhowe (~sethhowe@134.134.139.72) [02:40:24] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds) [02:59:04] *** Quits: mayank (6a3324e8@gateway/web/freenode/ip.106.51.36.232) (Ping timeout: 260 seconds) [03:24:37] mayank: I guess it could be related to some per-thread cache [03:24:56] are you allocating memory from SPDK-initialized threads? [03:27:00] memory allocation from non-SPDK threads might get stuck on some mutex [03:27:40] it's just a guess though - I haven't seen the allocation code [04:36:25] 10:53 -!- mayank [6a3324e8@gateway/web/freenode/ip.106.51.36.232] has quit [Ping timeout: 260 seconds] [10:51:45] jimharris: I looked at the thin provision review. It has a power fail safety hole [10:55:32] yeah - so this gets fixed in a later patch but we can have him move it up to here [10:55:44] in which patch? [10:55:53] one sec [10:56:15] the next patch actually [10:56:23] that one is totally wrong too though [10:56:38] you do want to claim the cluster right away (in memory) [10:56:44] to prevent other threads from claiming it [10:57:11] it does [10:57:12] the only part that's right about it is the atomic compare and swap on completion to make sure another thread didn't race and win [10:57:46] oh I see the update map applies to the local blob metadata [10:57:48] not the global mask [10:58:23] there is no atomic compare and swap in that patch though - that was the one piece of feedback I provided on it [10:58:29] yeah [10:58:48] since these two patches are contiguous we could ask mszwed to just squash them [10:59:01] I don't think they're separable without introducing a temporary bug [10:59:22] I'll add a note [10:59:38] *** Joins: lhodev (~Adium@inet-hqmc02-o.oracle.com) [11:00:35] I had suggested originally to make it a different patch - but with the refactoring we've done that second patch is so small now I think merging is the better choice [11:01:07] I almost think that instead of a flag to say whether to update the cluster map [11:01:20] they should look at just moving the code to update the cluster map outside of that function [11:01:24] to all of the call sites [11:01:30] not sure how many there are or how bad it would be [11:01:42] but that's just code organization - other than that it looks fine [11:01:51] (once it has the atomic) [11:02:20] overall i'm pretty happy with how it has turned out [11:02:43] like you said there are some code organization things we can go back and look at but i think we've kept the changes fairly localized [11:46:34] now that I get through more of it I agree [11:46:45] it's looking fine - just need to merge first two and swap order of operations [11:47:05] and later find some elegant way to deal with the channel issue [11:57:52] Are there any plans for adding a fault injection framework into the spdk? [11:58:31] it depends what you mean by that [11:58:48] do you want to randomly fail syscalls like malloc? Or do you want to randomly inject disk errors? [11:59:08] (and maybe not randomly) [12:02:37] NVMe/disk errors. [12:04:11] so we already have a bdev module that can inject errors at that level [12:04:30] you can stack it on top of any "real" bdev, and then inject different types of errors via RPC at any time [12:05:29] Are they block read/write errors, or are they nvme commands being failed? [12:05:42] it's inside of the block stack in SPDK [12:05:47] so block errors [12:06:12] it tests anything above it in the stack - so other logical bdevs, iscsi, nvmf, etc. [12:06:19] it does not test the low level NVMe driver's handling [12:06:43] Would it be useful to have a facility to create errors at a lower level? [12:06:59] probably - do you have a strategy in mind? [12:07:14] right now we just unit test those sort of errors [12:07:39] someone could also use a PCIe analyzer that is NVMe aware - I believe you can buy those [12:12:36] I picked up the hello_world/error displaying "Low Handling Fruit" item from the trello board to get familar with the spdk dev process and I'm trying to figuire how how to test it:) Looks like a nvme fault injection mechanism is being added to the kernel driver. [12:14:51] you mean the "Decode IO errors in NVMe hello_world example"? [12:15:00] yes [12:15:04] ah - that's a good question [12:15:32] that one is even simpler I think - it doesn't want you to actually figure out how to inject errors [12:15:41] just add code showing how you'd parse an error to the hello_world example [12:15:52] but I think Alan wants to make sure he did it right :) [12:16:36] there is no programmatic way to really do that today - a quick and dirty test would be to modify the hello_world app locally to use an invalid LBA instead of LBA 0 [12:16:47] correct:) But an error injection facility could be an interesting project. [12:17:28] it would - something that could insert itself way down at the nvme completion handler for example [12:17:59] did you end up writing some utility functions to parse errors? expanding the very simple spdk_nvme_cpl_is_error macro for example? [12:18:02] without impacting performance of course (maybe it's a build time option to enable it) [12:18:32] is that question to me or AlanP_? [12:18:38] AlanP_ [12:20:05] a function that could parse the 16 byte completion and print a string would be really cool [12:20:14] and then you could just unit test that [12:20:29] we already have it in lib/nvme/qpair.c - get_status_string() [12:20:35] I'd like to see something like that (fault injection at the lower level) too, and would chime in my preference to seeing it as a build-time option to avoid compromising performance. There may even be some open source application-level fault injection frameworks rather than rolling our own. [12:20:36] it just needs to be exposed through the public API [12:23:08] I just added code to hello_world.c to parse and display the error. To test it I just "corrupted" the nvme command to to force an error to be retunred by the device. [12:23:09] injecting the error is easy - determining when and where to inject it is trickier part - probably an operation done on an nvme qpair [12:25:59] nvme_pcie_qpair_complete_tracker - all completions go through this function - i'll bet this could even be compiled in by default and not impact performance [12:26:40] I'll take a look at this. [12:27:26] great! [12:28:30] you'll see some bools in spdk_nvme_qpair already which are touched in the fast I/O path - so adding a bool denoting if error injection is enabled should add little to no overhead especially if you wrap the check with spdk_unlikely() [12:30:24] sethhowe: it looks like fedora-05 and fedora-08 are missing qemu-img - can we install the normal qemu package, or will that conflict with our custom qemu build for vhost? [12:33:30] alternatively, why isn't the qemu-img etc. from our custom qemu installed in the path? [12:34:53] drv/bwalker: can one of you +2 https://review.gerrithub.io/#/c/395865/ again? [12:35:17] done [12:35:22] -2! [12:35:31] jk, done [12:36:53] i get a -2 for my review of that patch [12:37:44] drv: trying the bdevperf microbenchmark with LTO enabled [12:39:30] cool, hopefully we get some free performance :) [12:41:31] seems to be in the noise [12:41:54] is the binary larger? [12:47:21] if it's not, then the compiler didn't seem to think there were any good LTO candidates [12:48:02] linker rather [12:48:15] I'm not sure inline hinting works for LTO [12:48:24] or if we have any LTO functions in mind even [12:49:00] some of the stuff in env probably is a good target [12:49:29] but when we moved env to a library from just a header, I didn't measure any negative performance impact [12:49:32] something like spdk_vhost_gpa_to_vva() defined in lib/vhost/vhost.c is a good target [12:49:34] so that matches up with what you are seeing [12:50:14] do we do -O3 today? [12:50:19] no, just -O2 [12:50:31] crank it up to -Oamillion [12:50:46] I doubt it will make much difference - I think -O3 mostly turns on loop unrolling and things like that [12:51:39] although it might increase inlining limits and things like that; worth a try [13:21:30] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [13:43:19] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [13:57:58] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [14:16:51] looking at rte_mbuf some more [14:16:59] for the DPDK compressdev API [14:17:22] it's two cachelines, but only the first line would/should be touched for compression operations [14:17:44] so we grab one per iov, fill in the phys addr [14:17:47] and we obviously don't have to use an mbuf pool for all allocations [14:18:05] we just need a pool of mbufs sufficient to cover the requested queue depth [14:18:10] we can do something similar to bdev_io - we have a big pool, but then each channel has its own TAILQ to optimize alloc/dealloc [14:18:11] jimharris: do you need this [TEST] patch? https://review.gerrithub.io/#/c/396281/ [14:18:13] or maybe to cover the optimal queue depth to the hardware [14:18:19] drv: no [14:18:31] seems to be empty somehow too :) [14:18:31] i was testing if you can push a patch to gerrithub with no code changes [14:32:46] bwalker: i'm thinking that in terms of compression, the extra rte_mbuf overhead may not be terrible [14:34:40] we could probably make a test bdev module to estimate the overhead [15:00:41] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Read error: Connection reset by peer) [15:00:58] drv, yeah I can repro that bdev ASAN error on my system pretty easy, will take a closer look [15:01:05] cool [15:01:29] *** Quits: sethhowe (~sethhowe@134.134.139.72) (Remote host closed the connection) [15:01:45] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [15:12:06] *** Joins: sethhowe (~sethhowe@192.55.54.42) [15:12:39] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [15:50:42] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) [15:53:46] *** Joins: nvme-newbie (d05b0202@gateway/web/freenode/ip.208.91.2.2) [15:54:11] Hi [15:55:35] i am not sure if this this the right forum.. i am looking to configure nvmeof target to have ipv6 address.. i have been able to do initiator - target communication with ipv4 address.. "Listen RDMA " in the nvmf.conf file.. [16:02:44] nvme-newbie: currently, the SPDK nvmf_tgt only supports IPv4 - there is a hard-coded inet_addr() call in lib/nvmf/rdma.c [16:03:01] this needs to be fixed to use getaddrinfo(), which would allow IPv6 support as well [16:03:31] Got it.. Thanks @drv [16:05:19] I've added a Trello card to track this: https://trello.com/c/ktKfL95s [16:09:08] awesome.. thanks.. [16:09:22] Are the "SPDK Automated Test System" (ci.spdk.io/builds/….) links resolving? I tried to go back and look at, granted, an older build, and got a '404'. And, so I peeked at a more recently updated review on gerrithub albeit not one I "owned", but when I attempted to follow its build link, I once again a '404'. [16:10:01] lhodev: we have tweaked the code that cleans up old builds recently - if it's a review that has already been merged, or if it's older than 30 days, it will get deleted [16:10:23] but if it's not one of those, maybe there was a problem copying a particular build? [16:10:52] I attempted to examine one today that, if I'm reading it correctly, appears to have been time stamped today at 1:11pm. [16:11:48] Here's the Gerrit change: https://review.gerrithub.io/#/c/363114/ [16:12:31] The Automated Test System (Patch Set 15) build link is: https://ci.spdk.io/builds/review/ef3cf9b019e09ac74da764883828c2abd94fae30.1516914244 [16:12:55] hmm, yeah, that seems to have failed to copy for some reason [16:13:45] *** Quits: nvme-newbie (d05b0202@gateway/web/freenode/ip.208.91.2.2) (Quit: Page closed) [16:15:55] The most recent (per my last check) change, https://review.gerrithub.io/#/c/395556/, on the other hand does have a build link therein that does resolve. [16:15:57] *** Quits: boutcher (~boutcher@66.113.132.66) (Quit: Textual IRC Client: www.textualapp.com) [17:13:31] Hi Jim, I may not understand your concern about unlink correctly and want to confirm my understanding. [17:13:42] I think duplicate binding to an unix domain socket among multiple applications is not possible. [17:13:54] hence even if iscsi-tgt and nvmf-tgt runs concurrently, the application which can use an unix domain socket is only one. [17:14:05] What you are concerned about -> is the possible race condition such that -> -> -> or -> -> -> [17:14:14] Correct? [17:15:15] hi Shuhei [17:15:32] i agreed with you, that application should not unlink the rpc socket if it already exists [17:16:01] we try to unlink it on app exit [17:16:10] sigkill can obviously not unlink it [17:16:12] I like the idea of adding a lock file [17:16:18] but we could catch sigsegv [17:16:45] hmmm [17:17:05] that article Shuhei linked makes it look not too hard [17:17:24] and the kernel will release the lock if the process crashes, so we can differentiate between "another process is using this socket" vs. "a previous instance crashed" [17:17:27] makes what not look too hard? adding a lock file, or catching sigsegv? [17:17:31] adding a lock file [17:17:42] i will need to look at that [17:17:53] yeah - we could have /var/tmp/spdk.sock and /var/tmp/spdk.sock.lock [17:18:03] right [17:19:13] oh yeah - i like shuhei's idea much better [17:19:41] Hi Jim, Harris, thanks, dual, iscsi-tgt and nvmf-tgt will become interesting solution. [17:20:22] we are also talking about running iscsi-tgt and nvmf-tgt in the same application - we have made lots of improvements to make this possible [17:20:53] so have spdk-tgt, which could do iscsi, nvme-of, and/or vhost [17:21:13] Thank you, sounds great, I'll try locking file idea if you are OK [17:21:29] maybe you could ask gang to try the locking file idea? [17:22:11] or you could do it - but i am sure gang would be happy to do it with your direction [17:23:52] Sorry of course, I was intended to Gang or I and I will talk with Gang, [17:24:22] And Gang catch this issue officially, and Gang will do the best fix. [17:25:03] I will update the result of this chat to Gerrit and wait for Gang's feedback. [17:25:19] thank you! [17:25:32] Gang and I have communicated well with QoS already. [17:25:55] Hence we will be able to do good job for this too. [17:26:02] you will be able to meet him at the spdk summit in beijing [17:26:55] Oh, I'll be happy to meet with Gang, thank you. [17:31:18] Was looking at the spdk.io page where I remembered reading previously of the "SPDK's First Developer Meetup", but I don't see anything about an "spdk summit in beijing". Where can I read more details about that? [17:32:27] hi lhodev - let me find a link for you [17:39:57] Thanks Jim. I just googled "spdk summit beijing" which returned: https://www.bagevent.com/event/1177885. Did I overlook a ref to the summit on spdk.io? Didn't see anything on the home page or the blog or news subpages. [17:45:18] hi lance - no, you didn't miss it - an announcement will be going out very shortly [17:46:29] but you did find the correct link [17:48:31] Thx Jim. Just wanting to keep abreast of such events, especially if they're in the U.S. where I might have greater opportunity of attending. Bummed I missed out on the meetup in AZ a few months ago. [17:56:16] nothing confirmed yet - but there should be a similar summit/meetup in the US probably Q2 this year [18:48:02] *** Quits: lhodev (~Adium@inet-hqmc02-o.oracle.com) (Quit: Leaving.) [18:50:54] QQ: what does "part" stand for in the bdev layer (as in spdk_bdev_part_base, etc.)? [19:01:48] *** Joins: ziyeyang_ (ziyeyang@nat/intel/x-skgnqjsvlrmkubil) [20:19:20] *** Quits: ziyeyang_ (ziyeyang@nat/intel/x-skgnqjsvlrmkubil) (Ping timeout: 252 seconds) [20:53:15] *** Joins: ziyeyang_ (ziyeyang@nat/intel/x-aeosrebecqlnbknv) [23:51:23] *** Joins: ziyeyang__ (ziyeyang@nat/intel/x-twrixpddsybjddal) [23:51:23] *** Quits: ziyeyang_ (ziyeyang@nat/intel/x-aeosrebecqlnbknv) (Remote host closed the connection) [23:53:58] *** Quits: ziyeyang__ (ziyeyang@nat/intel/x-twrixpddsybjddal) (Remote host closed the connection) [23:54:21] *** Joins: ziyeyang_ (ziyeyang@nat/intel/x-pdphfilzwfiktvqc)