[00:01:14] Hi Ziye, looking into the stack dump and comparing with code is the first choice as you did. It's OK not to reply how to cope with calsoft. Thank you. [00:08:16] Hi Shuhei, [00:09:03] Got it. Currently the nop in will check every 1 sec for all connections in a cpu core, not sure this chosen value is applicable for not [00:12:49] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds) [00:45:22] *** Joins: Param (~Param@106.208.12.0) [00:49:00] *** Quits: Param (~Param@106.208.12.0) (Client Quit) [00:57:06] *** Quits: ziyeyang__ (~ziyeyang@192.55.54.40) (Quit: Leaving) [01:55:40] *** Joins: tkulasek (~tkulasek@192.55.54.39) [02:04:24] *** Quits: sbasierx (~sbasierx@192.198.151.44) (Quit: Going offline, see ya! (www.adiirc.com)) [02:46:24] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 260 seconds) [02:49:43] *** Joins: ziyeyang_ (~ziyeyang@134.134.139.72) [03:09:37] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.72) (Ping timeout: 248 seconds) [03:12:35] *** Joins: guerby_ (~guerby@ip165.tetaneutral.net) [03:13:10] *** Quits: guerby (~guerby@april/board/guerby) (Read error: Connection reset by peer) [03:13:44] *** guerby_ is now known as guerby [03:14:36] *** Quits: guerby (~guerby@ip165.tetaneutral.net) (Changing host) [03:14:36] *** Joins: guerby (~guerby@april/board/guerby) [03:25:04] *** Joins: ziyeyang_ (~ziyeyang@134.134.139.72) [03:26:52] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [03:40:21] *** Joins: sbasierx (sbasierx@nat/intel/x-yozyhrmupoacibdr) [03:58:09] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.72) (Remote host closed the connection) [04:10:44] *** Joins: ziyeyang_ (~ziyeyang@192.55.54.44) [04:35:05] Hi Jim, another three patches are provided to simply or refactor spdk_iscsi_conn_execute function [04:35:26] *** Quits: ziyeyang_ (~ziyeyang@192.55.54.44) (Quit: Leaving) [04:47:39] *** Joins: Param (~Param@106.208.95.1) [04:48:43] *** Quits: Param (~Param@106.208.95.1) (Client Quit) [05:57:21] *** Joins: johnmeneghini (~johnmeneg@216.240.30.5) [06:44:53] *** Quits: johnmeneghini (~johnmeneg@216.240.30.5) (Quit: Leaving.) [07:39:01] *** Quits: sbasierx (sbasierx@nat/intel/x-yozyhrmupoacibdr) (Remote host closed the connection) [08:49:44] intel.com [08:49:48] ugh [09:38:37] drv: "capacity * 512 must be a multiple of block_size." could we just print a warning and continue if that happens? [09:39:05] darsto_: seems ok to me [09:39:19] (this first batch of comments is from a while ago, let me refresh my memory) [09:39:32] ahh right, sorry [09:41:12] I just forgot to hit Post :) [09:41:38] most of these are minor details - if you want to fix them up in another patch, that's probably fine too [09:41:44] the JSON-RPC one is a real bug, though [09:41:51] nah, i'll get them [10:53:29] *** Quits: tkulasek (~tkulasek@192.55.54.39) (Ping timeout: 260 seconds) [10:57:10] jimharris: I've reviewed the DPDK 18.02 patches, and they look OK to me - just needs your +2 and then I'll merge the series [10:57:12] https://review.gerrithub.io/#/c/401664/ [11:00:21] lgtm [11:20:00] hmm, looks like some of the config changes didn't get applied correctly to the new common_spdk [11:20:07] CONFIG_RTE_EAL_IGB_UIO=n is commented out for some reason [11:20:17] causing https://ci.spdk.io/spdk/builds/review/9ed4f10d4891aa201d8933e1fb246cdddb4dfa6e.1519927722/centos6/build.log [11:22:14] (also, two machines failed in iSCSI tests, which seems suspicious) [11:44:20] jimharris, pwodkowx: please see https://review.gerrithub.io/#/c/402142/ (should fix DPDK 18.02 build on older distributions) [11:45:37] verified by grep/sort/diffing the two configs - they are the same now except for new =n additions from DPDK 18.02 [13:28:00] jimharris: Jim, please take a look at my responses to your review in https://review.gerrithub.io/#/c/401243/3. I want to ensure we have a meeting of the minds before submitting a follow-on patch. Thanks! [13:34:23] done - just posted a reply [13:36:44] jimharris: Thanks, and about my implementation in spdk_reactor_get() and spdk_reactors_fini()? The decision not to employ an assert()? [13:41:11] i think your changes there are ok [13:43:43] I just added one more comment. I think we should continue setting g_spdk_app.rc to whatever is passed via spdk_app_stop() rather than force setting it to a one (1) if non-zero. One of the test cases actually passes in a counter of the number of failures. [13:44:08] sounds good [14:13:30] hmm, both iSCSI test machines failed with the DPDK 18.02 update again: https://ci.spdk.io/spdk/builds/review/512252e54cb865f883ab5471e66df363023d1743.1519935236/ [14:13:47] must be related somehow, but I don't understand how [14:18:02] needs to be rebased - it doesn't have d95bb [14:18:31] oh [14:18:46] anyway, needs another DPDK submodule fix for that deprecated warning - centos6 compiler is too old, I guess [14:40:24] drv: i fixed that one nit on stephen's cmb patch https://review.gerrithub.io/#/c/401832/ [14:40:38] ok, I'll take another look [14:42:00] jimharris: (meekly) can we re-open the conversation about differentiating positive/negative results from spdk_app_start()? There's one ugly case: it's lib/rocksdb. In my original implementation, detection of spdk_app_start() failing (as opposed to the app stopping it) provides for the ability to fail the init all the way back to the caller. If I don't differentiate, then the constructor SpdkEnv::SpdkEnv() will hang "forever" spinnin [14:42:00] g in a while loop waiting for g_spdk_ready to become true and it never will if spdk_app_start() failed internally. I could, of course, change initialize_spdk() such that after spdk_app_start() returns I set g_spdk_ready to true (regardless) which would enable the SpdkEnv::SpdkEnv() constructor to continue, but we wouldn't be aware of the start failure therein. With knowledge of the start failure, I was able to throw an exception. Instea [14:42:00] d, the constructor will continue onward to invoke SpdkInitializeThread() when that shouldn't happen at all because of the internal start failure. [14:46:27] looking [14:47:28] can the initialize_spdk check just be for rc != 0? [14:50:07] If it's just not-zero, then it could be an spdk_app_stop(non_zero_value) happened. That's valid. But, if it's due to an internal failure in spdk_app_start() we should set a flag so that the constructor fails and throws an exception. This is made more complicated by the fact that this code is running in a pthread. [14:52:52] spdk_app_stop(non_zero_value) could happen due to a subsystem failing to initialize [14:53:07] Yup [14:54:01] don't we want to treat that the same as an internal failure in spdk_app_start()? to me, the caller of spdk_app_start() shouldn't need to differentiate between the two [14:54:36] i definitely want to look at this further, to differentiate between errors during startup vs. fatal errors during runtime or shutdown [14:54:50] i'm just not sure we need to make that decision right now for the rocksdb case [14:55:07] errors during startup is more than just an internal failure in spdk_app_start() [14:56:55] I understand your concerns about wanting to study this further. I think this particular case highlights one of the reasons for making the distinction. If I largely leave rocksdb as is, then the constructor will spin forever in the while loop because g_spdk_ready will never get set to true. [14:57:57] i think some rocksdb changes are fine here - my only suggestion was changing the rc check to != 0 [14:58:07] doesn't that fix this issue? [15:00:36] So, if I only check if rc !=0 (where rc gets the value of spdk_app_start()), what do I do with that? The "g_spdk_start_failure" flag is a new flag that I added. Are you proposing that I keep that and check for it in the SpdkEnv constructor and throw the exception? In that case, we'd be throwing the exception even if the app successfully started, but was spdk_app_stop'd with a non-zero value. [15:06:53] i think that's fine for now - certainly no worse than it is currently [15:07:29] we need to do a full scrub of all of the spdk_app_stop() calls [15:08:24] let me look at your patch one more time [15:09:30] Jim, I appreciate your patience looking into this case and your concerns for wanting more time to perform due diligence on such a change. It's unfortunate that this specific case is so gnarly. [15:11:22] so let's take your case where everything comes up ok, but for some reason we get an error during shutdown - in this case, g_spdk_ready would have been set [15:11:44] could you change your check to throw an exception if !g_spdk_ready && g_spdk_start_failure? [15:15:04] I'm cogitating….. [15:17:56] for rocksdb - a lot of the spdk_app_stops() are tied to applications not associated with rocksdb - for example, anything in app/ examples/ or test/ would have no effect on rocksdb [15:18:38] we should probably clean up any spdk_app_stop() calls in lib next... :) [15:19:02] there really aren't any I don't think [15:19:05] one subsystem failing should probably not kill the whole app [15:19:09] I mean, there are some in lib/event [15:19:12] it does [15:19:23] oh sorry - you said "should not" :) [15:19:37] it probably does currently, but we'll need to rethink that [15:19:53] well i think the subsystems need to be scrubbed to only report failure for truly catastrophic cases [15:19:57] right [15:20:02] if we can't allocate our bdev_io pool, that's bad [15:20:40] you're right, though - there aren't too many app_stops in lib currently [15:20:53] the one in vhost probably needs to be fixed/removed [15:21:30] all the other ones that I can see would only cause problems during startup [15:21:55] yeah, I just want to avoid cases where e.g. one NVMe controller can't start up - that shouldn't prevent the whole app from starting [15:22:34] yep [15:22:57] and lhodev thought this was going to be a simple task :) [15:23:53] Lol, oh no-in-the-3rd-person did he think such a thing (in spite of the associated Trello card in the "low hanging fruit" stack) ;-) [15:25:42] :) [15:25:57] Honestly, I'm actually grateful for having had the opportunity to work on this. It's these kinds of tasks that really help one to become more acquainted with the code base. [15:28:10] I can also very much appreciate the caution and concern for a sweeping change, and from a relatively new SPDK contributor no less. [15:31:18] jimharris: unrelated, I have worked on the lib/net/interface.c exit()'s, grok'd how the subsystem init works and made a patchset I'm about to submit for that. Thankfully, it's not as gnarly as the start/stop one ;-) [16:16:49] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) [16:29:57] jimharris: Because we did chat about this a little, just wanted to give you a heads up I pushed the lib/net patch and it has run through the TP. https://review.gerrithub.io/#/c/402165/ [18:05:26] *** Joins: Param (~Param@157.49.151.247) [18:14:56] ben,Daniel code can be reviewed.. https://review.gerrithub.io/c/401314/ [18:56:33] *** Joins: ziyeyang_ (~ziyeyang@134.134.139.75) [20:20:44] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.75) (Quit: Leaving) [21:11:16] *** Quits: nKumar (sid239884@gateway/web/irccloud.com/x-nfnqldjvawstbntm) (Read error: Connection reset by peer) [21:11:19] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [21:12:15] I thought the SPDK community meeting was supported be help now [21:12:24] held now [21:12:45] *** Joins: nKumar (sid239884@gateway/web/irccloud.com/x-adferxojhoolvpcd) [21:13:17] *** Quits: Param (~Param@157.49.151.247) (Quit: Param) [21:13:34] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 260 seconds) [21:15:12] Trello boad says we are meeting on Meeting - (Asia) Thu Mar 1 UTC 04:00-05:00 [21:16:02] So I guess that was Wednesday Feb 28 at 11:00pm ET [21:16:05] yep, it's currently Fri Mar 2 UTC 4:15 [21:16:08] I missed it [21:16:32] Sigh [21:16:32] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [21:17:01] I think going forward we can arrange some kind of calendar invite or at least a time zone converter widget - it can get confusing [21:27:45] Trello board says the next meeting is Topics: (Euro) Thu 8th UTC 16:00. I thought we agreed to move the Euro series to Tuesday UTC 15:00-16:00? [23:01:17] *** Joins: lhodev1 (~Adium@66-90-218-190.dyn.grandenetworks.net) [23:03:13] *** Quits: lhodev (~Adium@66-90-218-190.dyn.grandenetworks.net) (Ping timeout: 240 seconds)