[00:06:48] *** Joins: wzh (~wzh@114.255.44.139) [00:08:53] Hi, maintainers. Please re-run my patch https://review.gerrithub.io/c/spdk/spdk/+/422203 [00:20:25] Patchset4 and Patchset6 are same (except for README.md), but Test System gave different results [00:26:30] *** Joins: tkulasek (~tkulasek@192.55.54.42) [00:31:32] Oh, I mistypped CHANGELOG.md with README.md in previous message [04:31:27] *** Parts: wzh (~wzh@114.255.44.139) ("WeeChat 1.9.1") [05:54:05] jimharris, 2 sorta maybe ugly things with ipsec and the crypto build in CI (a) need to get nasm on the CI systems and (b) the make install line in dpdkbuild/Makefile requires that the caller be su [07:05:46] jimharris: Good morning. Do you think the shared lib patches, 422307 and 422306, will get merged into master today (Fri)? [07:23:46] lhodev: i merged them just now - thanks!!! [07:23:57] Sweet, thanks! [07:24:31] peluse: ugh - that's a problem if a normal user can't do 'make' [07:24:49] yeah, any ideas? [07:25:24] do the ipsec library have parameters you can use to override the install location? [07:27:27] *** Joins: travis-ci (~travis-ci@ec2-54-205-147-253.compute-1.amazonaws.com) [07:27:28] (spdk/master) build: add build of individual shared libraries (Lance Hartmann) [07:27:28] Diff URL: https://github.com/spdk/spdk/compare/a6c89adf09e2...e9424c7e1ca6 [07:27:28] *** Parts: travis-ci (~travis-ci@ec2-54-205-147-253.compute-1.amazonaws.com) () [07:31:05] I'll see what I can find [07:31:34] have a call at 8:30, will head in after that and let ya know if I can come up with anything [08:42:44] *** Joins: sethhowe (sethhowe@nat/intel/x-dhiwnxtqusfcmqdw) [09:37:30] *** Joins: waelhalbawi (9457170c@gateway/web/freenode/ip.148.87.23.12) [09:46:53] Good morning! Can we re-run the CI tests for: https://review.gerrithub.io/c/spdk/spdk/+/422474/ [09:47:46] The Ubuntu_16 test is complaining about a line in my patch that is a comment. [10:00:13] hi waelhalbawi - for some reason the Jenkins CI did not retrigger when you pushed the updated version of your patch [10:00:30] i've retriggered it now [10:00:46] so those results you were looking at were from version 1 of your patch, not version 3 [10:05:42] *** Quits: waelhalbawi (9457170c@gateway/web/freenode/ip.148.87.23.12) (Ping timeout: 252 seconds) [10:09:29] *** Joins: waelhalbawi (9457170c@gateway/web/freenode/ip.148.87.23.12) [10:18:37] I thought so. Thanks, jimharris! [10:21:22] i still don't understand why Jenkins isn't picking up the latest version of your patch [10:22:26] Does it depend on the commit message in any way - I did fix a typo. [10:23:49] i don't think so - Jenkins still seems to be stuck on version 1 of your patch - let's just leave it as-is for now, I'd like Karol (klateck) to be able to debug this to make sure it doesn't come up again [10:25:31] Sounds good [10:34:47] I think sethhowe figured out the problem - should be passing the test pool here shortly [10:37:54] Cool! Was it related to the commit itself? [10:40:06] It wasn't. You were doing everything correctly on your end. [10:44:20] Thanks, Seth! [11:09:00] *** Quits: tkulasek (~tkulasek@192.55.54.42) (Ping timeout: 268 seconds) [11:30:00] *** Joins: Tracy35 (0cda5282@gateway/web/freenode/ip.12.218.82.130) [11:41:42] With nvmef, get a reproducable segfault when running IO. It crashed is in spdk_nvmf_rdma_request_process . The gdb trace is pasted in https://pastebin.com/tZVLGytT. Any pointer on to debug it further will be much appreciated. Thanks [11:47:16] what version of SPDK [11:48:10] This is master taken around July time frame. [11:50:36] how are you running the I/O? [11:52:15] From a kernel initiator using a home grown io test program. The same issue can be seen using fio also (but takes some time). [11:53:12] and the issue just occurs randomly? during reads? writes? while running or while shutting down? [11:53:46] With fio, can reproduce with 8K, 128 QDepth and using two initiators within 5 minutes. [11:54:33] random reads? [11:55:12] With mostly random writes, it is OK (runs for 50 hours). With mix of random writes/reads happens in the above mentioned timeframe. [11:55:45] I see [11:56:15] will have to try to reproduce - this hasn't been reported previously [11:57:13] Any specific area in the code you suspect can cause this issue? [11:57:43] the crash is in the state machine that walks requests through their processing [11:57:47] in lib/nvmf/rdma.c [11:58:19] can you print out the value of rqpair, rdma_req [11:58:22] We see/suspect that the rdma_req and rqpair structure has garbage value or is not initalized. [11:58:23] where you are broken in at [11:58:28] :-) [11:59:04] I suspect the connection actually dropped on you [11:59:12] and the rqpair was cleaned up [11:59:17] Give me a minute.. [12:00:22] We see several disconnects. Is this what you are refering as "connection acutally dropped"? [12:01:07] yeah your RDMA connection probably disconnected and the rqpair memory was released [12:01:25] but the internal state machine containing all of the requests probably wasn't correctly drained before releasing the resources [12:01:52] I have updated https://pastebin.com/tZVLGytT with the dump of rdma_req and rqpair. [12:03:27] Can we know the reason for the disconnect? Something we have missed configuring the initator or target side correctly? [12:03:31] yeah rqpair is junk [12:03:52] we don't track a reason - the NIC just tells us it dropped [12:03:58] is your test running through a switch? [12:04:29] Yes, through a switch. We can put it to P2P and check this out. [12:05:00] so two things for you to try [12:05:31] one is to get rid of the switch - there's all sorts of configuration you have to do on switches usually or you'll see random disconnects for rdma [12:05:42] and I'm not enough of an expert to explain how to configure the switch [12:06:13] the other thing to try is pull the latest master of spdk - in the last month we've done a ton of bug fixes for quiescing I/O before releasing the resources of a dropped connection [12:06:18] i.e. exactly the scenario you're hitting [12:06:37] Understood. We will remove the switch from this equation for now to check if it happens. [12:07:20] And I will put today's master and check it out. Give me 2 hours to complete this. [12:07:29] sure sounds great [12:07:30] A quick thing.. [12:08:54] Is v18.07 stable than the master? Trying to understand the community's stance on v18.xx.x or v17.xx.x release when compared to the master. [12:09:16] we do continuous integration, so all of the tests are run against any version (mostly) [12:09:48] the releases are just tags for bookkeeping purposes - we like people to use the releases because it gives us a common commit to work from when debugging [12:10:01] and we'd consider backporting fixes to releases if necessary [12:10:32] however, for the nvmf fixes I'm talking about, many of them went in after 18.07 was released [12:10:38] so you'd need to grab master for those [12:11:00] Ok. Thanks. You answered my next question :-). [12:11:30] For this issue, will pickup the master and run it. [12:11:41] Appreciate your help @bwalker. [12:11:45] sure thing [12:32:00] *** Quits: waelhalbawi (9457170c@gateway/web/freenode/ip.148.87.23.12) (Ping timeout: 252 seconds) [13:12:49] *** Joins: alekseymmm_ (bcf3adf1@gateway/web/freenode/ip.188.243.173.241) [13:43:08] *** Joins: waelhalbawi (94571704@gateway/web/freenode/ip.148.87.23.4) [15:15:54] *** Quits: alekseymmm_ (bcf3adf1@gateway/web/freenode/ip.188.243.173.241) (Ping timeout: 252 seconds) [17:42:45] *** Quits: waelhalbawi (94571704@gateway/web/freenode/ip.148.87.23.4) (Ping timeout: 252 seconds) [17:52:16] @bwalker: Without switch, things are OK. [17:53:49] Using the current master, just compiled and ran it and get "EAL: WARNING! Base virtual address hint (0x20040008f000 != 0x7f96c762e000) not respected!". Will look at it and once resolved, will run it with the switch. Thanks. [18:03:00] *** Quits: Tracy35 (0cda5282@gateway/web/freenode/ip.12.218.82.130) (Quit: Page closed) [23:31:52] *** Quits: drv (daniel@oak.drv.nu) (Remote host closed the connection) [23:33:02] *** Joins: drv (daniel@oak.drv.nu) [23:33:02] *** ChanServ sets mode: +o drv