[00:20:00] *** Quits: lhodev (~lhodev@66-90-218-190.dyn.grandenetworks.net) (Ping timeout: 246 seconds)
[00:24:52] *** Joins: lhodev (~lhodev@66-90-218-190.dyn.grandenetworks.net)
[00:29:20] *** Quits: lhodev (~lhodev@66-90-218-190.dyn.grandenetworks.net) (Ping timeout: 244 seconds)
[00:30:54] *** Joins: lhodev (~lhodev@66-90-218-190.dyn.grandenetworks.net)
[03:35:21] <darsto> bwalker: please see lhodev's comments in https://review.gerrithub.io/c/spdk/spdk/+/447851 ([TEST] SPDK 18.10.1 against DPDK 18.11 + one patch)
[04:31:00] *** Joins: travis-ci (~travis-ci@ec2-3-95-160-2.compute-1.amazonaws.com)
[04:31:01] <travis-ci> (spdk/master) test/vhost: Reduce time needed to wait for vm boot in vhost tests. (Pawel Kaminski)
[04:31:01] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/092360ce49fc...8228a208a67d
[04:31:01] *** Parts: travis-ci (~travis-ci@ec2-3-95-160-2.compute-1.amazonaws.com) ()
[05:06:19] *** Joins: travis-ci (~travis-ci@ec2-54-162-255-218.compute-1.amazonaws.com)
[05:06:20] <travis-ci> (spdk/master) env_dpdk: Run DPDK in legacy memory mode through spdk_env_opts (Shuhei Matsumoto)
[05:06:21] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/8228a208a67d...3c4199d6c631
[05:06:21] *** Parts: travis-ci (~travis-ci@ec2-54-162-255-218.compute-1.amazonaws.com) ()
[05:06:53] *** Joins: travis-ci (~travis-ci@ec2-3-88-3-71.compute-1.amazonaws.com)
[05:06:54] <travis-ci> (spdk/master) bdev/ftl: treat null UUID as no UUID (Konrad Sztyber)
[05:06:55] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/3c4199d6c631...02b0230296c3
[05:06:55] *** Parts: travis-ci (~travis-ci@ec2-3-88-3-71.compute-1.amazonaws.com) ()
[08:16:29] *** Joins: felipef (~felipef@62.254.189.133)
[09:58:34] <bwalker> lhodev: Does ./configure --without-igb-uio-driver not work in disabling the build of the DPDK kernel modules?
[09:58:48] <bwalker> that's the SPDK configure script
[09:59:01] <lhodev> The more pressing problem is the KNI module.   My compile fails.
[09:59:02] <bwalker> oh I guess you're doing vanilla DPDK
[09:59:35] <bwalker> so the way the DPDK build system works is you can use one of the default config files
[09:59:37] <lhodev> Not vanilla DPDK, per se.   I tried building with the unique branch you created in our fork of the DPDK.
[09:59:39] <bwalker> or you can make your own
[10:00:05] <bwalker> SPDK makes its own, but if you build against the vanilla DPDK (which is more or less what my branch is)
[10:00:15] <bwalker> it's on by default I think
[10:00:20] <bwalker> I'll put in a quick patch that turns it off
[10:02:24] <darsto> we modified the default config in our DPDK fork
[10:02:37] <bwalker> oh - I'll find that patch
[10:03:08] <bwalker> I see it
[10:03:55] <lhodev> in both config/common_base and config/common_linuxapp
[10:04:24] <darsto> lhodev - is is DPDK build that's failing for you?
[10:04:30] <darsto> is it*
[10:04:32] <lhodev> darsto:  yes
[10:04:56] <darsto> so KNI module and anything related to it require the kernel headers
[10:05:01] <lhodev> Right
[10:05:26] <lhodev> I then added a 'DPDK_OPTS += CONFIG_RTE_KNI_KMOD=n' to the SPDK's dpdkbuild/Makefile
[10:05:37] <darsto> if you don't have them, then DPDK will just fail to compile by default with a very ambiguous error message
[10:05:40] <lhodev> Unsure if that was entirely necessary, but it certainly enabled me to get the build done.
[10:09:20] <darsto> okay - so what are the next steps here?
[10:09:36] <bwalker> I'm updating the DPDK branch he's working from now
[10:09:40] <lhodev> As for SPDK-originated dpdk commit 754c3db, it appeared this is necessary so that we can make it all the way through the tests.   Passing all the tests is essential for me to communicate to folks in my company to give them confidence to sign off on the validation.
[10:11:35] <darsto> hmm, dpdk-18.11.1-oracle contains just 18.11 + our custom patches, not 18.11.1
[10:11:52] <bwalker> yes DPDK 18.11.1 isn't out yet
[10:11:57] <bwalker> or wasn't, when I made that
[10:12:00] <lhodev> Correct.   That's because -- well, last time I checked -- upstream DPDK still had not yet tagged 18.11.1.
[10:12:18] <bwalker> so we're going for DPDK 18.11 plus the minimum number of patches just to get this out the door
[10:12:21] <darsto> i see
[10:14:57] <bwalker> https://review.gerrithub.io/c/spdk/spdk/+/447851
[10:14:58] <bwalker> ok updated
[10:15:03] <bwalker> we'll see how this looks
[10:15:12] <bwalker> it's 18.10.2 against a vanilla DPDK 18.11 + 3 patches
[10:22:42] <lhodev> bwalker:   In the series you just uploaded, I see that config/common_linuxapp still has:  'CONFIG_RTE_KNI_KMOD=y'; i.e. not commented out.   I'm concerned that may still try to build and thus break, right?
[10:26:47] <lhodev> bwalker:  yup, it fails to build for me.
[10:27:11] <lhodev> == Build drivers
[10:27:20] <lhodev> make: *** /lib/modules/4.14.14-6.el7uek.lghdebug.x86_64/build: No such file or directory.  Stop.
[10:27:29] <lhodev> make[7]: *** [rte_kni.ko] Error 2
[10:35:17] <bwalker> ok one sec
[10:39:44] <bwalker> try now
[10:40:14] <lhodev> Will do
[10:40:16] <lhodev> Thx
[10:50:57] <lhodev> bwalker:   Still dies.   Your patch commented out LIBRTE_.*KNI.    We need to comment out CONFIG_RTE_KNI_KMOD
[10:52:37] <bwalker> crap
[10:54:05] <bwalker> that's fixed now
[10:54:21] <lhodev> Fetching and retrying.....
[10:58:45] <lhodev> Aaargh.   Compile failed, but in a different way.
[11:00:31] <lhodev> dpdk/drivers/crypto/scheduler/scheduler_pmd.c:11:10: fatal error: rte_reorder.h: No such file or directory
[11:00:31] <lhodev>  #include <rte_reorder.h>
[11:00:31] <lhodev>           ^~~~~~~~~~~~~~~
[11:00:32] <lhodev> compilation terminated.
[11:01:38] <bwalker> I probably need to comment out crypto too
[11:02:24] <bwalker> we have these all commented out in the spdk fork, but they were done in separate patches
[11:02:27] <bwalker> I need to pull each of them over
[11:03:03] <lhodev> Dang.   I'm sorry that this peel-the-onion has been such a slog.
[11:09:35] <darsto> bwalker: you want me to take over?
[11:09:52] <darsto> I can disable all those option and test the build locally
[11:18:24] <bwalker> please do
[11:27:10] <lhodev> darsto:   Thanks for the help.   It's appx 1:30pm in my timezone, and so I'm going to run out for lunch.   Back in a little while.
[11:28:52] *** Quits: felipef (~felipef@62.254.189.133) (Remote host closed the connection)
[11:29:54] <darsto> kk. i need to finish up one small thing first
[11:56:19] <darsto> done
[12:20:59] <bwalker> nice work - https://review.gerrithub.io/c/spdk/spdk/+/447851
[12:21:17] <lhodev> darsto: bwalker:   Built for me, *and* I note per Gerrit that it appears to have passed all the tests!   *happy dance*
[13:11:16] <lhodev> The link I had for the SPDK's IRC log (https://ci.spdk.io/irclog/) is complaining about an invalid certificate.   Is there a new link?
[13:14:09] <bwalker> not today, but there will be shortly
[13:14:28] <bwalker> I told the DNS company to delete ci.spdk.io on Friday
[13:14:32] <bwalker> so of course they deleted it today
[13:14:50] <bwalker> it doesn't matter much - we're all converted over except the irc log links
[13:16:01] <lhodev> Where will the IRC logs live?   AWS?
[13:16:07] <bwalker> yeah
[13:16:09] <bwalker> S3 bucket
[13:17:07] <lhodev> My DNS dropped on me a couple times in the past hour or so hence I was gonna snoop at the IRC logs to see if I had missed anything.
[13:17:39] <lhodev> DNS?   I meant VPN.    *sheesh*   Life in the world of acronyms/abbreviations....
[13:22:09] *** bwalker sets mode: -o bwalker
[13:25:12] *** peluse sets mode: +o bwalker
[13:25:21] *** bwalker changes topic to 'Storage Performance Development Kit - https://spdk.io/. This channel is logged at https://ci.spdk.io/irclog/'
[13:25:28] *** bwalker changes topic to 'Storage Performance Development Kit - https://spdk.io/. This channel is logged at https://dqtibwqq6s6ux.cloudfront.net/irclog/index.html'
[13:25:35] <bwalker> there we go
[13:40:40] <lhodev> Cool.  Thanks!
[13:45:39] <lhodev> bwalker:   I assume darsto is done for the night.   Can we move forward to merge 447851 and tag 18.10.2 ?
[13:46:57] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net)
[13:47:35] <darsto> I honestly wasn't sure if 447851 should be merged
[13:48:52] <bwalker> I wasn't going to merge 447851 - that was just a patch to confirm the scenario you need
[13:48:53] <darsto> adding support for a newer DPDK version is one one thing, and updating the submodule another
[13:49:11] <bwalker> right now the tip of 18.10.2 points at the spdk fork of DPDK 18.11.x
[13:49:30] <bwalker> that final patch is just to force the tests to run to confirm the way you want to package it also works
[13:49:50] <bwalker> which is SPDK 18.10.2 against vanilla DPDK 18.11 + a couple patches
[13:50:16] <bwalker> we can just tag 18.10.2 now that we've confirmed it
[13:50:42] <bwalker> I don't know if we need to do some internal release process
[13:50:51] <bwalker> usually we have to trigger some extra scans and stuff
[13:51:02] <bwalker> so it might need to be tomorrow
[13:51:10] <bwalker> they won't find any problems of course
[13:51:16] *** Quits: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) (Ping timeout: 250 seconds)
[13:51:19] <lhodev> But doesn't SPDK 18.10.2 contains some 30 patches to facilitate its ability to work with a DPDK 18.11.x ?
[13:52:53] <bwalker> it does
[13:53:18] <bwalker> we're using the exact same SPDK code for both scenarios
[13:53:33] <bwalker> https://review.gerrithub.io/c/spdk/spdk/+/448379 proves it works against the SPDK fork of DPDK 18.11.x
[13:53:52] <bwalker> and https://review.gerrithub.io/c/spdk/spdk/+/448379 proves it works against the exact code you'd like to package for Oracle as DPDK 18.11
[13:54:01] <lhodev> Oh, I misunderstood.  I thought you were stating that the only diff between SPDK 18.10.1 and 18.10.2 currently was an updated ptr to the special fork branch of dpdk.
[13:54:14] <bwalker> ah yeah
[13:54:36] <bwalker> the scans won't find anything because I know what the scans are looking for - not because we didn't make changes
[13:55:54] <bwalker> also all the changes we made are just cherry-picks of other already released patches
[13:56:04] <lhodev> I had been so focused on that last commit for updating the dpdk hash that I neglected to take note that all of the (29?) SPDK-commits for 18.10.2 had been merged
[13:56:15] <bwalker> yep
[13:56:24] <bwalker> we're basically waiting on internal tool process stuff and it's out
[13:56:40] <bwalker> Tomek can probably tag during EU business hours tomorrow
[13:58:31] <lhodev> Ok, then I'll hang tight for that tag and then do a pull to sync up our copy which will enable me to kick off our package build.
[14:03:43] <darsto> wait, we don't have tomek for this entire week
[14:04:06] <peluse> who approved that?? :)
[14:04:13] <lhodev> (chuckle)
[14:04:45] <lhodev> I assume someone else can do the tag business, huh?
[14:04:51] <bwalker> oh yeah he's on vacation
[14:05:08] <bwalker> either darsto or I can - I just have to figure out how he runs the scans
[14:05:16] <bwalker> not sure if you know darsto
[14:05:18] <darsto> I have no idea
[14:05:33] <darsto> not even where to start
[14:12:04] <bwalker> I think I can do it
[14:19:26] <jimharris> looks like EU is going to abolish daylight savings time?
[14:20:08] <jimharris> all of the Intel SPDK folks would never have to worry about time changes ;)
[14:24:19] <bwalker> hopefully the whole world sees the "light" soon enough
[14:24:29] <lhodev> Ouch
[14:24:49] <lhodev> I want to retain daylight-savings all the time.   Dispense with "standard" time.
[14:26:05] <lhodev> Didn't California recently pass that?
[14:41:17] <jrlusby> okay so forgive any ignorance on my part, I'm just hopping into figuring out this issue with iscsi_tgtd and im not super familiar with spdk in general, but here goes my question
[14:42:24] <jrlusby> we have some tests that do rude things to clusters like kick drives from the kernel and iscsi_tgtd seems to be hanging sometimes in these tests
[14:42:42] <jrlusby> pstack of hung iscsi https://www.irccloud.com/pastebin/GfDmwxwR/
[14:44:39] <jrlusby> I could use some help getting to the bottom of why its hanging and how to prevent it and minimize downtime
[14:45:31] <bwalker> is this the latest version of spdk?
[14:45:39] <jrlusby> we initially noticed this because our rpc.py calls would all start hanging once iscsi_tgtd got into a bad state, and we put together some hacky babysitting in another service to check for these commands hanging and kill iscsi_tgtd when that happens which mostly recovers it
[14:45:43] <jrlusby> we're on 19.01 i believe
[14:45:47] <bwalker> I think it's tripping a problem in DPDK's hotplug handling
[14:46:13] <jimharris> "kick drives from the kernel" -> can you explain exactly what this means?
[14:47:42] <jrlusby> i can do my best but these are parts of the system that I've not messed with much personally, let me dig into one of the tests and try to figure out exactly how its working
[14:48:26] <jrlusby> and in general im not 100% clear on whats causing the hang, it having to do with fiddling with drives is just our best guess based on what tests seem to run into this problem
[14:51:52] <jimharris> knowing what the test is doing would help in this case - entering this callstack that you've posted is related to a LOGICAL_UNIT_RESET on one of the iSCSI LUNs - would be good to know more about what's causing that LUN reset
[14:52:08] <jrlusby> something along the lines of injecting errors into a scsi sense buffer
[14:52:26] <jimharris> for example - is it an explicit reset sent by one of your tests?
[14:52:55] <jimharris> interesting - which bdev modules are you using as the LUN backing storage?
[14:53:34] <jrlusby> let me see if i can figure that out
[15:02:37] <jrlusby> aio
[15:02:46] <jimharris> ok
[15:03:11] <jimharris> how are errors injected into the sense buffer?
[15:04:15] <jimharris> and any idea how it is "kicking drives from the kernel"?  is it literally deleting the kernel block device while the iscsi_tgtd is using it?
[15:04:36] <jimharris> that's a valid test case that SPDK should handle - just trying to understand if that's what's being done in your test
[15:04:50] <jrlusby> understood
[15:05:11] <jrlusby> doing my best to explain but I've kinda black boxed all that stuff mentally up until now
[15:06:04] <jrlusby> https://www.irccloud.com/pastebin/OP6MpCsq/
[15:06:17] <jrlusby> aaand
[15:06:23] <jrlusby> https://www.irccloud.com/pastebin/Qyachrrd/
[15:08:30] <jrlusby> i dont know if this explains it well at all
[15:10:26] <jrlusby> I dont think this is the one that kicks it from the kernel
[15:10:28] <jrlusby> lets not get confused here
[15:10:38] <jrlusby> let me checkout the test that specifically mentions kicking drives from the kernel
[15:11:41] <jimharris> sorry - in a meeting now, will take a look afterwards (30 minutes or so)
[15:11:45] <jrlusby> the one that removes from the kernel is way easier to understand
[15:11:51] <jrlusby> https://www.irccloud.com/pastebin/b1a3VL1u/
[15:26:00] <jrlusby> so yea I definitely misrepresented the first test, that one is replicating io errors, which is causing our system to decide that a drive has gone bad and evac it, which is probably somewhere else kicking the drive out or something, most likely by directly telling iscsi that it should remove that drive, which is what is then causing the hang
[15:34:41] <jimharris> well that last one is certainly pretty clear
[15:35:04] <jimharris> how reproducible is this hang, when the backing kernel block device is deleted like this?
[15:39:43] <jrlusby> its not too difficult to reproduce with the tests but I've not yet figured out a good way to manually reproduce it quickly
[15:40:43] <jrlusby> im gonna try futzing with the above device delete but I still need to get my head around how the iscsi_tgtd management code was written
[15:40:54] <jrlusby> our code not the spdk stuff*
[15:42:09] <lhodev> bwalker:  Per our previous conversation about spdk_nvme_ctrlr_get_num_ns(), I am updating the doxygen comments in nvme.h.   Related, I believe the comments for spdk_nvme_ctrlr_get_ns() are incorrect in that it states "There will never be any gaps in the numbering."   Right?   Presuming so, I'll fix that too.
[15:42:26] <jimharris> i think i understand your test case enough to help guide someone from our team on repro - would you mind filing an issue on github for this?  https://github.com/spdk/spdk/issues - don't worry about filling out all of the template information, a quick synopsis of what you've described here will be sufficient
[15:42:57] <jimharris> we'll discuss at our next spdk bug scrub meeting
[15:43:15] <jimharris> once it's in github, it will get broader attention
[15:45:20] <jrlusby> will do
[15:45:45] <bwalker> well, it depends on what you mean by "gaps"
[15:46:00] <bwalker> let me go read the comment real quick
[15:47:03] <bwalker> so I think there aren't any gaps in the sense that you can call that function with any ns_id from 1 to the total number of namespaces and it's going to return to you a valid struct spdk_nvme_ns * pointer
[15:47:18] <bwalker> that pointer may just be pointing at a namespace that is "inactive"
[15:47:26] <lhodev> Remember how we came up with that example if one had MaxNamespaces set to say 10, but someone instantiated a new namespace with id 4.   Hence 1, 2 and 3 would be "inactive".
[15:47:44] <lhodev> So you can call spdk_nvme_getrlr_get_ns() on an INactive namespace ?
[15:47:51] <bwalker> I'll check
[15:48:13] <bwalker> if it returns NULL, then we should fix the comment
[15:49:53] <bwalker> yeah I think it's going to return a valid pointer
[15:50:33] <lhodev> Really?   I thought that's where those "other" guys' code failed...
[15:51:56] <bwalker> I can't remember at the moment - but looking at the function we store the namespaces as an array of struct spdk_nvme_ns
[15:51:58] <lhodev> Snooping through the uses of spdk_nvme_ctrlr_get_ns() in the SPDK code base, I consistently see it paired with spdk_nvme_ctrlr_get_first_active_ns() and spdk_nvme_ctrlr_get_next_active_ns() hence the assumption.
[15:52:01] <bwalker> not pointers - full structs
[15:52:14] <bwalker> and all the function does is return an index into that array
[15:52:17] *** Joins: travis-ci (~travis-ci@ec2-3-95-205-53.compute-1.amazonaws.com)
[15:52:18] <travis-ci> (spdk/master) test: remove duplicate fio.py script file (Karol Latecki)
[15:52:18] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/02b0230296c3...b6abc16b0501
[15:52:18] *** Parts: travis-ci (~travis-ci@ec2-3-95-205-53.compute-1.amazonaws.com) ()
[15:52:24] <bwalker> if you go beyond max nsid, it will return null
[15:54:11] <peluse> jimharris, do you think I should rebase my entire chain with the tip of your or wait til its merged and just rebase on master?
[15:54:13] <lhodev> Right.   I guess the question then becomes if that array entry is bounds but is associated with an INactive namespace, is there something in the entry that indicates that, or can it be garbage, stale, etc.
[15:54:46] <jimharris> peluse: good question - thinking...
[15:56:23] <jimharris> i'd say leave them as-is for now - i'm guessing most of my patches will get a second +2 by tomorrow and then you can rebase from master
[15:56:35] <peluse> OK, cool. thanks
[15:56:45] <jimharris> your patches are near the top of my review list so hopefully we can get at least the first couple of yours in shortly after that
[15:57:11] <peluse> thanks, I believe last time I checked (yesterday) all comments have been addressed
[15:57:57] *** Joins: travis-ci (~travis-ci@ec2-23-20-87-228.compute-1.amazonaws.com)
[15:57:58] <travis-ci> (spdk/master) ocf: Added zeroing memory returned from mempool (Michal Mielewczyk)
[15:57:58] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/b6abc16b0501...b949f1318ef9
[15:57:58] *** Parts: travis-ci (~travis-ci@ec2-23-20-87-228.compute-1.amazonaws.com) ()
[15:58:21] <lhodev> bwalker:   The enum spdk_nvme_ns_flags don't appear to reflect a value for whether the entry is unallocated.
[15:58:49] *** Joins: travis-ci (~travis-ci@ec2-3-95-160-2.compute-1.amazonaws.com)
[15:58:50] <travis-ci> (spdk/master) rpc: add get_spdk_version rpc method (Chunyang Hui)
[15:58:50] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/b949f1318ef9...38902a5a270b
[15:58:50] *** Parts: travis-ci (~travis-ci@ec2-3-95-160-2.compute-1.amazonaws.com) ()
[16:00:52] <lhodev> Perhaps one should use spdk_nvme_ns_is_active() to test for that?
[16:01:18] <bwalker> yes that's the function that tests for that
[16:01:28] <bwalker> but if we can make the API here more intuitive I'm open to changes
[16:10:50] <jimharris> thanks jrlusby for the github issue - i'm responding to shuhei's comment now
[16:23:31] <peluse> its been suggested that we include the following compiler options for SPDK. They seem to work OK on my dev system, curious if anyone has looked at these before or has an opinion...
[16:23:33] <peluse> -fno-strict-overflow tells the compiler NOT to assume that signed overflow does not occur.
[16:23:33] <peluse> -fno-delete-null-pointer-checks tells the compiler NOT to assume that null pointer deference does not exist.
[16:23:33] <peluse> -fwrapv tells the compiler that signed overflow always wraps.
[16:24:59] <jimharris>  let's have vishal run his nvme performance stress test with those options to see if there's any performance impact
[16:25:19] <jimharris> can you sync up with him?
[16:29:51] <bwalker> can you first confirm that those aren't the defaults?
[16:30:01] <peluse> yup on both accounts
[16:31:12] <bwalker> you're confirming that those are NOT the default settings?
[16:31:15] <bwalker> just to be clear
[16:32:59] *** Joins: travis-ci (~travis-ci@ec2-100-25-216-123.compute-1.amazonaws.com)
[16:33:00] <travis-ci> (spdk/master) blobstore: switch to spdk_*malloc(). (Darek Stojaczyk)
[16:33:00] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/38902a5a270b...530f481259ee
[16:33:00] *** Parts: travis-ci (~travis-ci@ec2-100-25-216-123.compute-1.amazonaws.com) ()
[16:34:33] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97)
[16:35:24] *** Joins: travis-ci (~travis-ci@ec2-3-88-3-71.compute-1.amazonaws.com)
[16:35:25] <travis-ci> (spdk/master) ocf: batched request processing in queue poller (Vitaliy Mysak)
[16:35:25] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/530f481259ee...2c651f101cb3
[16:35:25] *** Parts: travis-ci (~travis-ci@ec2-3-88-3-71.compute-1.amazonaws.com) ()
[16:48:00] <jrlusby> @jimharris  i succeeded in reproducing the failure with the steps you described
[16:49:17] <jrlusby> not sure if its worth mentioning in the github issue that it doesn't always happen, first time I tried the steps spdk survived unscathed and I brought the disk back in and everybody was happy
[16:49:29] <jrlusby> second attempt caused a hang though
[16:49:53] <jimharris> definitely add a note in the github issue
[16:50:04] <jimharris> what kind of fio workload were you running when you deleted the block device?
[16:53:25] <jrlusby> https://www.irccloud.com/pastebin/YdmoR9jv/
[16:57:33] <jrlusby> I didnt follow the steps... "exactly"
[16:58:53] <jrlusby> I went through a few extra layers we have for our product rather than turning off our daemons and trying to connect to iscsi directly, though I mostly did it this way because I don't know the commands to mess with iscsi directly but I already know how to do it through our io stack
[17:38:33] <jrlusby> @Shuhei we can also talk here if you prefer
[17:49:56] <Shuhei> jimharris, jrlusby:
[17:50:08] <jimharris> hi Shuhei!
[17:50:40] <jimharris> unfortunately, i will be signing off here very shortly
[17:50:48] <Shuhei> I created aio bdev by pc.py construct_aio_bdev /dev/sdd aio0 and then echo 1 > /sys/block/sdd
[17:51:21] <Shuhei> No,
[17:51:39] <Shuhei> echo 1 > /sys/block/sdd/device/delete
[17:51:58] <Shuhei> I'm not sure this is correct operation.
[17:52:05] <Shuhei> during FIO runs.
[17:52:29] <Shuhei> I saw the FIO error but there was no hung.
[17:53:02] <Shuhei> I looked the configuration jrlusby posted.
[17:53:41] <Shuhei> major difference may be iodepth of FIO (8 and 1024), and number of iSCSI target (I used a target with 4 bdevs but jrlusby used 4 target with each bdev)
[17:54:20] <Shuhei> I'll try whatever but honestly this is the first time to use aio bdev.
[17:54:31] <Shuhei> So I'm afraid that my operation was wrong.
[17:54:54] <jimharris> hi Shuhei - what kind of device is /dev/sdd?
[17:55:34] <jimharris> and jrlusby - what kind of device are you using with aio?
[17:55:35] <Shuhei> HDD
[17:55:43] <Shuhei> I have NVMe SSD
[17:56:03] <jimharris> i would have guessed that HDD would be the type of device most likely to reproduce this problem
[17:56:28] <jimharris> but let's see what jrlusby was using - this could be key configuration difference
[17:56:41] <Shuhei> OK, make sense.
[17:56:51] <Shuhei> By the way, my operation of removal is correct?
[17:57:08] <jimharris> i think so, yes
[17:57:16] <Shuhei> Thank you!
[17:58:30] <jimharris> signing off now - if i have some extra time tomorrow i may try to reproduce this also
[17:58:51] <Shuhei> OK, thank you for your help!
[18:02:17] <jrlusby> the removal is correct @Shuhei
[18:02:26] <jrlusby> thats what I did to get the repro manually
[18:02:54] <jrlusby> but when I reproduced the error there were other processes running in the background that may have caused the actual issue
[18:04:41] <jrlusby> namely we have a daemon that gathers all sorts of status information about the system, including some info from iscsi_tgtd, and runs various commands based in response to certain statuses, so at a minimum theres various commands like `rpc.py get_portal_groups` and `rpc.py get_bdevs` running in the background regularly
[18:06:04] <jrlusby> we have hdds and ssds, i dont know off of the top of my head if sda was an hdd or an ssd
[18:07:01] <Shuhei> jrlusby: thank you, I think I have enough information now.
[18:07:04] <jrlusby> its definitely not an NVMe drive though
[18:07:15] <jrlusby> though all of this work is being done to eventually support NVMe
[18:08:25] <Shuhei> Please wait for my update, I hope I will give any update by your tomorrow morning.
[18:08:57] <Shuhei> Oh, I  have a question
[18:09:22] <Shuhei> you used only randwrite?
[18:09:42] <Shuhei> I haven't tried randwrite, I tried randread and randrw once for each.
[18:10:13] <jrlusby> thats the fio I used during the manual repro
[18:10:24] <Shuhei> OK
[18:10:51] <Shuhei> And thanks for your input, I'll focus on HDD for now.
[18:11:12] <jrlusby> can you post the fio cmdline you're using when you're trying to repro?
[18:12:22] <Shuhei> I had used fio config file.
[18:12:44] <Shuhei> I'll change to use command line simply and will share that.
[18:13:31] <jrlusby> i can use config file
[18:13:33] <Shuhei> And I have used physical HDD and every try will need server reboot, so every try will take a few minute.
[18:13:39] <jrlusby> im just unfamiliar with the fio as a utility
[18:13:45] <Shuhei> Me too.
[18:13:57] <Shuhei> So no problem.
[18:14:18] <Shuhei> I will follow what you did
[18:15:21] <jrlusby> okay
[18:15:33] <Shuhei> I have a hour meeting 10 minutes later.
[18:16:01] <Shuhei> I'll restart after that. thank you for your help.
[18:16:25] <jrlusby> okay
[18:17:02] <jrlusby> for when you do get back, when I kicked the drive I didnt kick it via the handle created by the iscsiadm login, aka `/dev/sde`, i used the original block device handle `/dev/sda`
[18:17:29] <jrlusby> or more accurately `echo 1 > /sys/block/sda/device/delete`
[18:17:42] <jrlusby> I'm not sure if that matters
[18:20:56] <Shuhei> I deleted the backing device to AIO bdev as you did.
[18:21:11] <Shuhei> I didn't delete device visible from FIO.
[18:22:30] <jrlusby> there might be an extra step to delete the bdev that our daemon does in the background
[18:24:37] <jrlusby> I haven't managed to get a clean repro without the rest of the system running, but if you wanna see where im at heres a vague outline of the test script im trying to write.
[18:24:51] <jrlusby> https://www.irccloud.com/pastebin/EX57Tgl2/
[18:30:17] <jrlusby> actually fk that script its garbage atm ill link it again when it vaguely does what I expect
[18:30:18] *** Joins: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net)
[18:34:32] *** Quits: felipef (~felipef@cpc92310-cmbg19-2-0-cust421.5-4.cable.virginm.net) (Ping timeout: 250 seconds)
[18:42:17] <jrlusby> just need to figure out why my fio command is broken but heres what I got atm @Shuhei
[18:42:29] <jrlusby> https://www.irccloud.com/pastebin/ZqcaougB/test_iscsi_hang
[19:20:53] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 256 seconds)
[19:47:01] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97)
[20:02:49] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 256 seconds)
[21:37:05] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97)
[23:10:04] *** Quits: guerby (~guerby@april/board/guerby) (Remote host closed the connection)
[23:14:54] *** Joins: guerby (~guerby@april/board/guerby)
[23:26:00] <spdk-jenkins-bot> Project autotest-nightly build #441: STILL FAILING in 25 min. See https://ci.spdk.io/spdk-jenkins for results.
[23:39:23] <spdk-jenkins-bot> Project autotest-nightly-failing build #310: STILL FAILING in 39 min. See https://ci.spdk.io/spdk-jenkins for results.