[00:57:39] *** Quits: ziyeyang_ (~ziyeyang@192.102.204.38) (Quit: Leaving) [02:01:37] *** Joins: mszwed (~mszwed@134.134.139.72) [02:08:38] *** Joins: Shuhei (7a11110a@gateway/web/freenode/ip.122.17.17.10) [02:09:56] Hi, in spdk jenkins test pool, nvmf_phy_autotest #179 looks fails within 5 seconds. It may have some issue. [02:10:21] Will you check the state of nvmf_phy_autotest #179 ? [02:11:00] For example, please see https://ci.spdk.io/spdk-jenkins/public_build/autotest-per-patch_16225.html [02:11:58] I'll send this to mailing list too. [02:14:45] Before sending, I will check my patches and log first. Thank you. [02:16:38] Please ignore my comment. Sorry for confusion. [02:25:41] *** Quits: Shuhei (7a11110a@gateway/web/freenode/ip.122.17.17.10) (Ping timeout: 256 seconds) [03:37:33] peluse, Shuhei thanks for info! I think the NVMeOF test server got into some weird state after a series of reboots and could not connect to network. Should be fine now, I am retriggering the patches [03:38:52] There was a number of other fails related to NVMeOF tests and I've disabled them from per-patch testing for now. [03:40:34] It seems that https://review.gerrithub.io/#/c/spdk/spdk/+/434895/ introduced the problem. Perf hangs when run as secondary process in test/nvmf/host/perf.sh and gets stuck so tests time-out. [03:40:59] darsto is currently working on the issue. [07:07:11] *** Quits: tomzawadzki (uid327004@gateway/web/irccloud.com/x-htenlxpjxywvutsp) (Quit: Connection closed for inactivity) [07:55:22] *** Joins: tomzawadzki (uid327004@gateway/web/irccloud.com/x-kmzgqacjzlghswye) [07:57:48] *** Joins: travis-ci (~travis-ci@ec2-54-147-167-118.compute-1.amazonaws.com) [07:57:49] (spdk/master) lib/trace: show specific usage of trace mask (Liu Xiaodong) [07:57:49] Diff URL: https://github.com/spdk/spdk/compare/efae4a1baa94...0e7ca66922e4 [07:57:49] *** Parts: travis-ci (~travis-ci@ec2-54-147-167-118.compute-1.amazonaws.com) () [09:30:20] *** Joins: travis-ci (~travis-ci@ec2-54-166-112-9.compute-1.amazonaws.com) [09:30:21] (spdk/master) nvme: fix pci device leak when detaching a controller in primary process (Darek Stojaczyk) [09:30:21] Diff URL: https://github.com/spdk/spdk/compare/0e7ca66922e4...1d3e0340b411 [09:30:21] *** Parts: travis-ci (~travis-ci@ec2-54-166-112-9.compute-1.amazonaws.com) () [09:38:14] Hi guys, remember topic regarding big RPC respones from last community meeting? [09:38:33] I updated trello card with measurments from within spdk: https://trello.com/c/gg8oXsdo/152-how-to-handle-big-rpc-responses [09:39:30] It takes ~4.3ms to generate base64 encoded response and ~18ms to generate raw json response [09:42:18] What do you think? 1) Is that so long, that we should go look for other mechanism for sending histograms 2) Difference between base64/raw json is not that big, we should go for the simpler one 3) Go for base64 as it is the best compromise [09:42:33] Or maybe you have some other thoughts? [09:42:45] the get_bdevs data point is interesting [09:43:15] without even looking at the other data - 0.87ms to send 17KB of data - that seems ripe for some optimization [09:44:16] that's like 50us per 1KB of data (or 20MB per second) [09:44:20] yeah, and this is just iterating over multiple data structures [09:44:43] where exactly did you instrument for the tsc counts? [09:45:08] in base64 approach for histograms I'm using spdk_base64_encode which I think might take some time... [09:45:41] i'm not so sure - 80KB of data in 4.3ms is a pretty similar rate to the get_bdevs data [09:46:29] meaning i don't think the encoding is the problem [09:46:30] I'm just counting the ticks spend in json parser, so tsc start with spdk_jsonrpc_begin_result and ends with spdk_jsonrpc_end_result [09:46:41] ^json writer, not parser [09:47:46] cool [09:47:58] i think for get_histograms, we just go with the base64 approach for now [09:48:13] ok, cool [09:49:51] there are ways we can cut down the amount of data we transfer - for example, provide an encoding scheme in the histogram code so that we condense each bucket (i.e. don't use up 8 bytes for a bucket with count=0) [09:50:48] I did similiar thing with JSON response [09:51:12] But have to think about how to do that with int array before encoding it to base64 [13:02:31] *** Joins: travis-ci (~travis-ci@ec2-54-160-241-128.compute-1.amazonaws.com) [13:02:32] (spdk/master) nvmf: add some instrumentation in error path (Jim Harris) [13:02:33] Diff URL: https://github.com/spdk/spdk/compare/1d3e0340b411...942e02aa6896 [13:02:33] *** Parts: travis-ci (~travis-ci@ec2-54-160-241-128.compute-1.amazonaws.com) () [13:03:45] *** Joins: travis-ci (~travis-ci@ec2-54-198-236-35.compute-1.amazonaws.com) [13:03:46] (spdk/master) pci: remove error prints on detach failures (Darek Stojaczyk) [13:03:46] Diff URL: https://github.com/spdk/spdk/compare/942e02aa6896...2bedc03df33f [13:03:46] *** Parts: travis-ci (~travis-ci@ec2-54-198-236-35.compute-1.amazonaws.com) () [13:12:23] Saw tomzawadzki request to gerrit-review a couple patches destined for branch spec'd as v18.10x. I've not so far ever participated in what I assume is a cherry-pick from master to a release branch. I'm assuming this is the standard-process; i.e. even cherry-pick's to a release branch involve gerrit reviews? [13:24:19] I thought I had a google-incantation to perform a search against the SPDK's irclog, but it doesn't seem to be working. Someone have that handy? [13:35:40] lhodev: Sorry, I'm not sure about the IRC log from google thing, but I may be able to help with the first question. [13:36:11] Please do [13:38:35] lhodev: When you push a commit to gerrithub, you push to a remote branch such as /refs/for/master. Then when you merge that change, it gets planted on the tip of that actual branch. So when we do a point release, we create another review branch (e.g. refs/for/18.10) with a sister branch in the mainline repository (18.10.1 in this case). When you push to that review branch and the change gets merged, it gets placed on top of the 18.10.1 [13:38:35] branch. [13:39:24] SO regardless of whether a change has already been merged to master, it needs to be pushed again to the other branch for review so it can be placed on top of that branch. [13:43:53] So, in that case, one would be doing something like 'git push origin HEAD:refs/for/18.10.1' ? [13:45:47] I'm also a bit confused that when I examine the branches via 'git branch -a', I see a: remotes/origin/v18.10.x -- not a remotes/origin/v18.10.1 That is, I'm noting the 'x' at the end and also the leading 'v'. [13:46:50] So, do you instead actually do a 'git push origin HEAD:refs/for/v18.10.x' ? [13:47:28] And then create a tag to denote the actual point release, e.g. v18.10.1 [13:57:22] that's exactly right [14:33:45] jimharris has either darsto or klateck talked to you about the failures they were seeing on the jenkins physical NVMe-oF tests? [14:34:23] i was debugging them whole them but still don't have a clue [14:34:46] the failures are still somewhat random [14:36:06] darsto I have had some success duplicating the failure locally on my system, but am having a little trouble isolating what could be happening. [14:37:01] I am running the perf test directly, just adding nvmftestinit to the beginning of perf.sh and I can get it to fail consistently on master. [14:37:12] in perf it's the very first I/O that fails [14:37:29] the one that's sent as a part of initializing i/o queues [14:38:12] there are iommu errors but I can't see any violations [14:38:45] i can even reproduce this on a couple days old master [14:40:09] i was debugging them whole day* [14:40:32] So, it seems to hit sporadically on the chandler test pool as well. But it is odd to me that it started hitting every time in Poland. I can also get it to hit every time on my local machine using the configuration I detailed above. [14:41:14] If I use the stub with the nvmf tgt application locally, the failure goes away, but if I just run the nvmf tgt directly it fails every time. [14:41:52] So it seems to me that having that initial 2G of preallocated memory from the stub seems to make it less likely to hit. [14:42:26] I am just brainstorming, but could a race condition between starting I/O and getting callbacks for registered memory cause an error like this? [14:42:48] nope, spdk_malloc() won't return until the memory is registered [14:43:42] the nvme disk on that nvmf_phy machine on jenkins doesn't support sgls [14:44:22] that could be one difference from the virtual nvme that we used so far [14:45:32] Which drive is in the nvmf_phy machine? We also have fedora-03 running these same tests with a P3700 in there. [14:45:58] sethhowe: jimharris: Via email thread I'm aware there was an issue with nvmf_tgt segfault'ing. Would the patch(es) addressing that be good candidates to backport to 18.10.x for the 18.10.1 release? I'm seeking to produce SPDK packages -- our inaugural ones -- internally at my company and so I'm a little concerned about the stability of that without those fixes. [14:46:10] +tomzawadzki [14:46:31] p3700 [14:47:17] lhodev: Those patches have all been tagged with the 18.10.1 hashtag so they are set to go into that backport. [14:47:32] Ah, very good. Thanks sethhowe ! [14:48:27] darsto This is interesting. I am going to play with adjusting the memory allocated to the stub to see if I can find a threshold where we start seeing the error consistently. [15:17:10] *** Quits: tomzawadzki (uid327004@gateway/web/irccloud.com/x-kmzgqacjzlghswye) (Quit: Connection closed for inactivity) [15:59:57] darsto: have you already filed this as a bug on github?