[01:48:58] *** Joins: guerby (~guerby@april/board/guerby) [04:01:15] *** Joins: felipef (~felipef@62.254.189.133) [04:01:46] *** Quits: felipef (~felipef@62.254.189.133) (Read error: Connection reset by peer) [04:13:42] *** Quits: mszwed (~mszwed@134.134.139.75) (Quit: WeeChat 1.9.1) [05:12:42] *** Joins: mszwed (mszwed@nat/intel/x-fpommfrlztkhttat) [07:50:16] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [07:52:17] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [07:52:48] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [09:27:09] *** Joins: travis-ci (~travis-ci@ec2-35-174-153-231.compute-1.amazonaws.com) [09:27:10] (spdk/master) ut/bdev_mt: fix io_channel leak in "unregister_and_close" (Darek Stojaczyk) [09:27:10] Diff URL: https://github.com/spdk/spdk/compare/1fb629c4d250...bb64a7e5118f [09:27:10] *** Parts: travis-ci (~travis-ci@ec2-35-174-153-231.compute-1.amazonaws.com) () [09:27:24] *** Joins: travis-ci (~travis-ci@ec2-54-166-233-198.compute-1.amazonaws.com) [09:27:25] (spdk/master) ocf: wait for pendings reqs in unregister (Vitaliy Mysak) [09:27:25] Diff URL: https://github.com/spdk/spdk/compare/bb64a7e5118f...e862b6434aaf [09:27:25] *** Parts: travis-ci (~travis-ci@ec2-54-166-233-198.compute-1.amazonaws.com) () [10:18:36] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) [10:22:57] *** Quits: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) (Ping timeout: 245 seconds) [11:21:10] *** Joins: mszwed_ (~yaaic@apn-31-0-34-203.dynamic.gprs.plus.pl) [11:45:54] *** Quits: mszwed_ (~yaaic@apn-31-0-34-203.dynamic.gprs.plus.pl) (Quit: Yaaic - Yet another Android IRC client - http://www.yaaic.org) [11:51:27] I dropped some more info into the github issue for the iscsi hang, looks like its getting in a loop that avoids the logging that shuhei added [11:54:42] It looks like line 3192 accidentally used < instead of <= [11:54:46] im trying it with that switched [12:00:16] im only inferring this because theres an if statement inside of the if from 3192 that checks the same values with `==` despite that being dead code as a result of the check on `3192` [12:00:24] assuming that the second check is correct because it matches the assert [12:00:36] yay for changing code without understanding i t [12:00:39] yay for changing code without understanding it [12:01:18] hi jrlusby [12:01:38] i saw your updates but haven't had a chance to look at them in much detail yet [12:05:50] this is helpful info - i'm looking through this abort code now [12:11:22] oh [12:11:31] this might be an easy fix [12:11:48] i'll have a patch to you in 5 minutes [12:19:09] jrlusby: can you try https://review.gerrithub.io/c/spdk/spdk/+/450301 [12:19:52] oh o kay [12:19:54] ill try that instead [12:26:15] i'm not 100% confident that's going to fully fix it - this should get you out of the infinite loop, but i'm still not clear on how it can get to that point [12:26:33] but please give it a shot [12:33:06] ACK [12:34:33] im just crazy excited to make progress [12:37:08] it worked? [12:37:27] well [12:37:29] the version I did [12:37:35] where I changed just < to <= [12:37:39] passed my first repro attempt [12:37:45] but thats happened before [12:37:57] it repros maybe 70% of the time [12:38:09] and now im rolling out hte version you made [12:38:21] where you moved the inner if statement up one scope [12:38:28] and going to repeat the test [12:38:32] but sofar so good [12:50:51] the abort in that while l oop [12:50:55] assert* [12:51:05] that would trigger an abort correct? [12:51:23] we have a giant pile of iscsi_tgtd core files somewhere [12:51:30] i can check if any of them are from that specific abort [12:51:40] I dont htink they are though because I would expect the repro to u sually abort then [12:51:58] (I'm thinking could the root cause be memory corruption of one of those two variables) [12:52:49] we definitely don't have consistent repros for dumping core files though so yea... [12:55:57] now i need to rework how I'm assessing if we've run into issues or not :/ [12:56:14] looks like the hang is fixed [12:56:15] as expected [13:01:43] great! how many times have you run it so far? [13:04:55] twice [13:04:59] running it for a third time now [13:05:07] but only once with your version of the change in place [13:27:35] still looking good? [13:56:31] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) [13:56:32] *** Quits: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) (Remote host closed the connection) [13:57:39] bwalker: I noted with surprise and amusement the DPDK announcement that 18.08.1 was released. Still no word on 18.11.1 ;-) [14:17:27] still looking good [14:17:32] we're just doing more indepth analysis on it [14:35:38] its looking real good now [14:39:27] thanks! the details on where it was looping, plus the datain_offset and transfer_len values were a huge hint [14:39:40] i just cant believe I didnt check those earlier [14:39:50] it wasnt until shuhei's logs weren't printing that I even checked [14:39:54] facepalm [14:39:57] it's always obvious after the fact [14:40:00] yeaaaa [14:40:05] ill probably be in touch [14:40:10] because we've seen some crashes [14:40:48] but for now its looking like we might be able to start making progress without our test suits always failing for iscsi_tgtd related reasons [14:45:56] is there an e-mail address i can use to add a Tested-by credit for you on this patch? [15:01:15] *** Quits: sethhowe (~sethhowe@134.134.139.75) (Ping timeout: 255 seconds) [15:01:18] *** Quits: MaZ- (~maz@dont.feed.the.lawlr.us) (Ping timeout: 246 seconds) [15:01:19] *** Quits: pniedzwx (~pniedzwx@134.134.139.75) (Ping timeout: 246 seconds) [15:16:21] jlusby@scalecomputing.com [15:19:07] what is the i/o size you're using in your test? [15:19:50] answering that with certainty would require that I dig into our load-profile vms and figure out how they're configured [15:20:08] but I can give you the manual fio we're running that Im told is based on the load we run through our vms [15:20:30] https://www.irccloud.com/pastebin/zxK8xpmA/ [15:20:49] ok - they're obviously large i/o since they're going through the datain path and the data you showed was 1048576 [15:21:20] but it's possible the i/o was actually larger than that and it got cut off when the aborts happened [15:21:33] aborts? [15:21:40] ah - ok, 512B - 1MB [15:21:42] are you talking about stack traces from the core files I've posted? [15:21:59] yeah [15:22:56] so iSCSI has this concept of MaxRecvDataSegmentLength - it's basically a way for the initiator to say "don't send me more this amount of data at a time" [15:23:33] its specified by the intiator? [15:23:41] so that even if the initiator sends a 1MB read, the target will send that data back in smaller chunks (i.e. 64KB) [15:23:43] yes [15:23:45] aaah [15:23:51] understood [15:23:54] each of these chunks is a "DataIn PDU" [15:24:49] sometimes we need to queue up some of these very large reads, so that one connection can't hog too many buffers at once - this is the queued_datain_tasks list [15:25:11] when we get something like a LUN reset, we have to abort all of the existing I/O - including any of these queued large reads [15:25:28] and that's what that function is doing - the one that i patched [15:25:30] and was hung [15:26:35] what i can't figure out, is how this function gets a task where the current_datain_offset == transfer_len - if this had already happened, it should have been removed from the list already [15:27:23] If you want more help figuring out how I'm not gonna disappear and I'm interested in getting to the bottom of it oto [15:27:26] too* [15:27:51] I also wanted to attend the next bug scrub to possibly get a post mortem on the issue and any unanswered questions that maybe leftover [15:28:23] next one is next tuesday morning 8AM Pacific time [15:28:39] 👍 [15:44:52] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) [15:52:46] we just saw one rpc timeout but the test wasnt running a version of our management daemon that leaves iscsi_tgtd in a bad state for triaging [15:53:00] so it murdered the shit out of iscsi_tgtd and w e dont know if it was just a spurious timeout or a hang [15:53:10] digging into i t [15:59:23] do you know which rpc timed out? [16:00:08] jimharris: Do you have much/any historical knowledge about the DPDK's implementation of determining primary vs. secondary? [16:00:56] get_portal_groups get_target_nodes get_initiator_groups [16:01:31] lhodev: it's based around the --file-prefix passed to rte_eal_init [16:02:16] DPDK creates a shared memory file using this name, and puts the mappings in there for hugepage files and the addresses to map them to [16:02:18] Oh yeah, I've come to see how it's doing it. It's that a group ran into an issue. It's arguable that they're execution environment should be altered, but it was a weird edge case. [16:02:30] s/they're/their/ [16:02:41] you may want to try #dpdk [16:03:20] The core of the issue is that one process -- the primary -- was started as id root, and then another process that is expected to run as secondary was started as non-root. [16:03:43] that sounds like a problem [16:04:27] The trouble is that DPDK's eal_create_runtime_dir() uses different logic -- analyzing the user id -- as part of crafting the directory name where it's gonna stuff the config file. [16:06:59] jrlusby: i see some questionable locking (or rather lack of locking) in some of those RPC implementations [16:07:11] Was curious how that came to be. If you're non-root (and their is no special env variable, XDG_RUNTIME_DIR, set), then it will default the starting point of the directory in /tmp. If you're root, then it defaults to /var/run. [16:07:39] lhodev: i'm really not sure how that came to be [16:08:12] No worries. Again, was just curious if you knew that history, the use of the user-id, special env variable, etc. [16:08:33] i try not to understand too much about how multi-process works ;-) [16:08:40] You're a very wise man. [16:10:33] before we jump to any conclusions [16:10:48] let me get a repro where we dont actually immediately try to kill iscsi_tgtd [16:11:00] so we can see if its an issue with the rpcs or with the daemon [16:13:02] sounds good - if you can repro such that you know the rpc timed out, i'd be curious if you can get a stack trace on iscsi_tgtd when that happens [16:13:19] thats exactly what I'm planning on doing :) [16:13:54] i got a bunch of tmux windows monitoring logs on every node in two different clusters currently running the test htat my coworker ran when he got a few timeouts [16:14:03] there is some synchronization that should be added to these RPCs, but i don't see yet how it could cause a problem [16:14:03] all o f them ready to beep at the first sign of an rpc timeout [16:14:15] and ill jump right in and pstack that nodes iscsi_tgtd as soon as it happens [16:14:21] and manually try querying with rpcs [16:14:32] the sad thing is [16:14:35] all this work for iscsi_tgtd [16:14:50] is planned to get thrown away once we've stabilized the changes in our stack on top of it [16:15:01] because this entire project is about optimizing our stack to take advantage of NVMe [16:15:16] so we're going to replace the iscsi stack with nvme very next thing [16:15:34] /shrug [16:54:27] okay got the repro [16:54:29] looks like a new issue [16:54:53] https://www.irccloud.com/pastebin/2fuRJW2e/ [16:56:04] ` pthread_mutex_lock(&target->mutex);` [16:56:26] let me gdb that mutex to figure out who has it [16:56:29] yeah - now to figure out why it's still locked [16:56:43] could you file a new issue in github for this? [16:57:08] we can keep your first one focused on that hang in the iscsi datain abort code [16:57:28] okay [16:57:31] doing that right now [17:05:25] https://github.com/spdk/spdk/issues/744 [17:05:56] posted a reply on github [17:06:12] i'm heading out now - have a great weekend! [17:06:46] Yabba, dabba, dooooo [17:08:34] okay, I'll leave this be for now then, thank you so much and see you next week! [20:08:53] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [20:09:21] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [20:11:24] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [20:11:54] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [20:13:57] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [20:14:27] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [21:02:34] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [21:03:02] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [21:05:03] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [21:05:34] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [21:07:37] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [21:08:08] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [21:10:10] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [21:10:41] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [21:12:43] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection) [21:13:14] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com) [22:27:41] Project autotest-nightly build #454: FAILURE in 27 min. See https://dqtibwqq6s6ux.cloudfront.net for results. [22:36:13] Project autotest-nightly-failing build #320: STILL FAILING in 36 min. See https://dqtibwqq6s6ux.cloudfront.net for results.