[01:48:58] *** Joins: guerby (~guerby@april/board/guerby)
[04:01:15] *** Joins: felipef (~felipef@62.254.189.133)
[04:01:46] *** Quits: felipef (~felipef@62.254.189.133) (Read error: Connection reset by peer)
[04:13:42] *** Quits: mszwed (~mszwed@134.134.139.75) (Quit: WeeChat 1.9.1)
[05:12:42] *** Joins: mszwed (mszwed@nat/intel/x-fpommfrlztkhttat)
[07:50:16] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[07:52:17] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[07:52:48] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[09:27:09] *** Joins: travis-ci (~travis-ci@ec2-35-174-153-231.compute-1.amazonaws.com)
[09:27:10] <travis-ci> (spdk/master) ut/bdev_mt: fix io_channel leak in "unregister_and_close" (Darek Stojaczyk)
[09:27:10] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/1fb629c4d250...bb64a7e5118f
[09:27:10] *** Parts: travis-ci (~travis-ci@ec2-35-174-153-231.compute-1.amazonaws.com) ()
[09:27:24] *** Joins: travis-ci (~travis-ci@ec2-54-166-233-198.compute-1.amazonaws.com)
[09:27:25] <travis-ci> (spdk/master) ocf: wait for pendings reqs in unregister (Vitaliy Mysak)
[09:27:25] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/bb64a7e5118f...e862b6434aaf
[09:27:25] *** Parts: travis-ci (~travis-ci@ec2-54-166-233-198.compute-1.amazonaws.com) ()
[10:18:36] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net)
[10:22:57] *** Quits: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) (Ping timeout: 245 seconds)
[11:21:10] *** Joins: mszwed_ (~yaaic@apn-31-0-34-203.dynamic.gprs.plus.pl)
[11:45:54] *** Quits: mszwed_ (~yaaic@apn-31-0-34-203.dynamic.gprs.plus.pl) (Quit: Yaaic - Yet another Android IRC client - http://www.yaaic.org)
[11:51:27] <jrlusby> I dropped some more info into the github issue for the iscsi hang, looks like its getting in a loop that avoids the logging that shuhei added
[11:54:42] <jrlusby> It looks like line 3192 accidentally used < instead of <=
[11:54:46] <jrlusby> im trying  it with that switched
[12:00:16] <jrlusby> im only inferring this because theres an if statement inside of the if from 3192 that checks the same values with `==` despite that being dead code as a result of the check on `3192`
[12:00:24] <jrlusby> assuming that the second check is correct because it matches the assert
[12:00:36] <jrlusby> yay for changing code without understanding i t
[12:00:39] <jrlusby> yay for changing code without understanding it
[12:01:18] <jimharris> hi jrlusby
[12:01:38] <jimharris> i saw your updates but haven't had a chance to look at them in much detail yet
[12:05:50] <jimharris> this is helpful info - i'm looking through this abort code now
[12:11:22] <jimharris> oh
[12:11:31] <jimharris> this might be an easy fix
[12:11:48] <jimharris> i'll have a patch to you in 5 minutes
[12:19:09] <jimharris> jrlusby: can you try https://review.gerrithub.io/c/spdk/spdk/+/450301
[12:19:52] <jrlusby> oh o kay
[12:19:54] <jrlusby> ill try that instead
[12:26:15] <jimharris> i'm not 100% confident that's going to fully fix it - this should get you out of the infinite loop, but i'm still not clear on how it can get to that point
[12:26:33] <jimharris> but please give it a shot
[12:33:06] <jrlusby> ACK
[12:34:33] <jrlusby> im just crazy excited to make progress
[12:37:08] <jimharris> it worked?
[12:37:27] <jrlusby> well
[12:37:29] <jrlusby> the version I did
[12:37:35] <jrlusby> where I changed just < to <=
[12:37:39] <jrlusby> passed my first repro attempt
[12:37:45] <jrlusby> but thats happened before
[12:37:57] <jrlusby> it repros maybe 70% of the time
[12:38:09] <jrlusby> and now im rolling out hte version you made
[12:38:21] <jrlusby> where you moved the inner if statement up one scope
[12:38:28] <jrlusby> and going to repeat the test
[12:38:32] <jrlusby> but sofar so  good
[12:50:51] <jrlusby> the abort in that while l oop
[12:50:55] <jrlusby> assert*
[12:51:05] <jrlusby> that would trigger an  abort correct?
[12:51:23] <jrlusby> we have a giant pile of iscsi_tgtd core files somewhere
[12:51:30] <jrlusby> i can check if any of them are from that specific abort
[12:51:40] <jrlusby> I dont htink they are though because  I  would expect the repro to u sually abort then
[12:51:58] <jrlusby> (I'm thinking could the root cause be memory corruption of one of those two variables)
[12:52:49] <jrlusby> we definitely don't have consistent repros for dumping core files  though so yea...
[12:55:57] <jrlusby> now i  need to rework how I'm assessing if we've run into issues or not :/
[12:56:14] <jrlusby> looks like the hang is fixed
[12:56:15] <jrlusby> as expected
[13:01:43] <jimharris> great!  how many times have you run it so far?
[13:04:55] <jrlusby> twice
[13:04:59] <jrlusby> running it for a third time now
[13:05:07] <jrlusby> but only once with your version of the change in place
[13:27:35] <jimharris> still looking good?
[13:56:31] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net)
[13:56:32] *** Quits: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) (Remote host closed the connection)
[13:57:39] <lhodev> bwalker:   I noted with surprise and amusement the DPDK announcement that 18.08.1 was released.   Still no word on 18.11.1 ;-)
[14:17:27] <jrlusby> still looking good
[14:17:32] <jrlusby> we're just doing more indepth analysis on it
[14:35:38] <jrlusby> its looking real good now
[14:39:27] <jimharris> thanks!  the details on where it was looping, plus the datain_offset and transfer_len values were a huge hint
[14:39:40] <jrlusby> i just cant believe  I didnt check  those earlier
[14:39:50] <jrlusby> it wasnt until shuhei's logs weren't printing  that  I even checked
[14:39:54] <jrlusby> facepalm
[14:39:57] <jimharris> it's always obvious after the fact
[14:40:00] <jrlusby> yeaaaa
[14:40:05] <jrlusby> ill probably be in touch
[14:40:10] <jrlusby> because we've seen  some crashes
[14:40:48] <jrlusby> but for now its looking like we might be able to start making progress without our test suits always failing for iscsi_tgtd related reasons
[14:45:56] <jimharris> is there an e-mail address i can use to add a Tested-by credit for you on this patch?
[15:01:15] *** Quits: sethhowe (~sethhowe@134.134.139.75) (Ping timeout: 255 seconds)
[15:01:18] *** Quits: MaZ- (~maz@dont.feed.the.lawlr.us) (Ping timeout: 246 seconds)
[15:01:19] *** Quits: pniedzwx (~pniedzwx@134.134.139.75) (Ping timeout: 246 seconds)
[15:16:21] <jrlusby> jlusby@scalecomputing.com
[15:19:07] <jimharris> what is the i/o size you're using in your test?
[15:19:50] <jrlusby> answering that with certainty would require that I dig into our load-profile vms and figure out how they're configured
[15:20:08] <jrlusby> but I can give you the manual fio we're running that Im told is based on the load we run through our vms
[15:20:30] <jrlusby> https://www.irccloud.com/pastebin/zxK8xpmA/
[15:20:49] <jimharris> ok - they're obviously large i/o since they're going through the datain path and the data you showed was 1048576
[15:21:20] <jimharris> but it's possible the i/o was actually larger than that and it got cut off when the aborts happened
[15:21:33] <jrlusby> aborts?
[15:21:40] <jimharris> ah - ok, 512B - 1MB
[15:21:42] <jrlusby> are you talking about stack traces from the core files I've posted?
[15:21:59] <jimharris> yeah
[15:22:56] <jimharris> so iSCSI has this concept of MaxRecvDataSegmentLength - it's basically a way for the initiator to say "don't send me more this amount of data at a time"
[15:23:33] <jrlusby> its specified by the intiator?
[15:23:41] <jimharris> so that even if the initiator sends a 1MB read, the target will send that data back in smaller chunks (i.e. 64KB)
[15:23:43] <jimharris> yes
[15:23:45] <jrlusby> aaah
[15:23:51] <jrlusby> understood
[15:23:54] <jimharris> each of these chunks is a "DataIn PDU"
[15:24:49] <jimharris> sometimes we need to queue up some of these very large reads, so that one connection can't hog too many buffers at once - this is the queued_datain_tasks list
[15:25:11] <jimharris> when we get something like a LUN reset, we have to abort all of the existing I/O - including any of these queued large reads
[15:25:28] <jimharris> and that's what that function is doing - the one that i patched
[15:25:30] <jimharris> and was hung
[15:26:35] <jimharris> what i can't figure out, is how this function gets a task where the current_datain_offset == transfer_len - if this had already happened, it should have been removed from the list already
[15:27:23] <jrlusby> If you want more help figuring out how I'm not gonna disappear and  I'm interested in getting to the bottom of it oto
[15:27:26] <jrlusby> too*
[15:27:51] <jrlusby> I also wanted to attend the next bug scrub to possibly get a post mortem on the issue and any unanswered questions that maybe leftover
[15:28:23] <jimharris> next one is next tuesday morning 8AM Pacific time
[15:28:39] <jrlusby> 👍
[15:44:52] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net)
[15:52:46] <jrlusby> we just saw one rpc timeout but the test wasnt running a version of our management daemon that leaves iscsi_tgtd in a bad state for triaging
[15:53:00] <jrlusby> so it murdered the shit out of iscsi_tgtd and w e dont know if it was just a spurious timeout or a hang
[15:53:10] <jrlusby> digging into i t
[15:59:23] <jimharris> do you know which rpc timed out?
[16:00:08] <lhodev> jimharris:   Do you have much/any historical knowledge about the DPDK's implementation of determining primary vs. secondary?
[16:00:56] <jrlusby> get_portal_groups get_target_nodes get_initiator_groups
[16:01:31] <jimharris> lhodev: it's based around the --file-prefix passed to rte_eal_init
[16:02:16] <jimharris> DPDK creates a shared memory file using this name, and puts the mappings in there for hugepage files and the addresses to map them to
[16:02:18] <lhodev> Oh yeah, I've come to see how it's doing it.   It's that a group ran into an issue.   It's arguable that they're execution environment should be altered, but it was a weird edge case.
[16:02:30] <lhodev> s/they're/their/
[16:02:41] <jimharris> you may want to try #dpdk
[16:03:20] <lhodev> The core of the issue is that one process -- the primary -- was started as id root, and then another process that is expected to run as secondary was started as non-root.
[16:03:43] <jimharris> that sounds like a problem
[16:04:27] <lhodev> The trouble is that DPDK's eal_create_runtime_dir() uses different logic -- analyzing the user id -- as part of crafting the directory name where it's gonna stuff the config file.
[16:06:59] <jimharris> jrlusby: i see some questionable locking (or rather lack of locking) in some of those RPC implementations
[16:07:11] <lhodev> Was curious how that came to be.   If you're non-root (and their is no special env variable, XDG_RUNTIME_DIR, set), then it will default the starting point of the directory in /tmp.   If you're root, then it defaults to /var/run.
[16:07:39] <jimharris> lhodev: i'm really not sure how that came to be
[16:08:12] <lhodev> No worries.   Again, was just curious if you knew that history, the use of the user-id, special env variable, etc.
[16:08:33] <jimharris> i try not to understand too much about how multi-process works ;-)
[16:08:40] <lhodev> You're a very wise man.
[16:10:33] <jrlusby> before we jump to any conclusions
[16:10:48] <jrlusby> let me get a repro where we dont actually immediately try to kill iscsi_tgtd
[16:11:00] <jrlusby> so we can see if its an issue with the rpcs or with the daemon
[16:13:02] <jimharris> sounds good - if you can repro such that you know the rpc timed out, i'd be curious if you can get a stack trace on iscsi_tgtd when that happens
[16:13:19] <jrlusby> thats exactly what I'm planning on doing :)
[16:13:54] <jrlusby> i got a bunch of tmux windows monitoring logs on every node in two different clusters currently running the test htat my coworker ran when he got a few timeouts
[16:14:03] <jimharris> there is some synchronization that should be added to these RPCs, but i don't see yet how it could cause a problem
[16:14:03] <jrlusby> all o f them ready to beep at the first sign of an rpc timeout
[16:14:15] <jrlusby> and ill jump right  in  and pstack that nodes iscsi_tgtd as soon as it happens
[16:14:21] <jrlusby> and manually try querying with rpcs
[16:14:32] <jrlusby> the sad thing is
[16:14:35] <jrlusby> all this work for iscsi_tgtd
[16:14:50] <jrlusby> is planned to get thrown away once we've stabilized the changes in our stack on top of it
[16:15:01] <jrlusby> because this entire project is about optimizing our stack to take advantage of NVMe
[16:15:16] <jrlusby> so we're going to replace the iscsi stack with nvme very next thing
[16:15:34] <jrlusby>  /shrug
[16:54:27] <jrlusby> okay got the repro
[16:54:29] <jrlusby> looks like a new issue
[16:54:53] <jrlusby> https://www.irccloud.com/pastebin/2fuRJW2e/
[16:56:04] <jrlusby> `		pthread_mutex_lock(&target->mutex);`
[16:56:26] <jrlusby> let me gdb that mutex to figure out who has it
[16:56:29] <jimharris> yeah - now to figure out why it's still locked
[16:56:43] <jimharris> could you file a new issue in github for this?
[16:57:08] <jimharris> we can keep your first one focused on that hang in the iscsi datain abort code
[16:57:28] <jrlusby> okay
[16:57:31] <jrlusby> doing that right now
[17:05:25] <jrlusby> https://github.com/spdk/spdk/issues/744
[17:05:56] <jimharris> posted a reply on github
[17:06:12] <jimharris> i'm heading out now - have a great weekend!
[17:06:46] <lhodev> Yabba, dabba, dooooo
[17:08:34] <jrlusby> okay, I'll leave this be for now then, thank you so much and see you next week!
[20:08:53] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[20:09:21] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[20:11:24] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[20:11:54] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[20:13:57] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[20:14:27] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[21:02:34] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[21:03:02] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[21:05:03] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[21:05:34] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[21:07:37] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[21:08:08] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[21:10:10] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[21:10:41] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[21:12:43] *** Quits: lhodev (~lhodev@inet-hqmc06-o.oracle.com) (Remote host closed the connection)
[21:13:14] *** Joins: lhodev (~lhodev@inet-hqmc06-o.oracle.com)
[22:27:41] <spdk-jenkins-bot> Project autotest-nightly build #454: FAILURE in 27 min. See https://dqtibwqq6s6ux.cloudfront.net for results.
[22:36:13] <spdk-jenkins-bot> Project autotest-nightly-failing build #320: STILL FAILING in 36 min. See https://dqtibwqq6s6ux.cloudfront.net for results.