[03:08:34] Ok, so it looks like when the problem occurs my fio-driven I/O is stopping. I wrote a script to monitor the number of single-plane reads performed on my storage unit. Initially, when the fio job is running well, that advances at an appropriate high rate. After the problem, it drops to a low background level reflecting some maintenance things going on. [03:09:01] I should say that one of our developers got involved in this a couple of days ago and told me that there are no signs of any error condition over on the product side when this happens. [03:09:20] I don't know that I have 100% confidence in that, but my confidence level is at least fairly high. [03:10:58] For all I know that explains the change to the fio output line - maybe the bandwidth and IOPS information vanishes because there's none to report. [03:11:35] But there's still the fact that before the problem happens I can interrupt the fio/spdk job with ctrl-C, and after it happens I can't - I have to get the process id and kill -9 it. And the -9 is required. [03:14:06] My storage unit appears as two separate NVMe drives (it's designed to connect to two separate processing cores in a high-availability system). [03:14:17] But I've established that this problem will occur using either one alone. [03:15:26] Oh, and it remains the case that I am 99% sure that this issue has not occurred when using perf. I'm about to test that some more, now that I've got the ability to watch performance reported from the storage side during the job. [03:21:54] It's hard to get a fully accurate match on the reported I/O rates; our unit supports compression and what I'm measuring on the product side is physical flash operations. [03:25:00] Anyway, I need to be quiet here - this doesn't look like it's an spdk issue. I'll ask one lsat question - a couple of days ago one of you told me in something of a rough way how to go about getting diagnostic / logging information out of spdk. It would be helpful to have spdk tell me what it thinks is wrong when I get fio to roll over, if possible. Can I ask for a repeat of that? It involved a [03:25:02] configure-time option, I believe. [03:25:50] I guess the answer may be "nothing" - from spdk's perspective it may just be fio stops asking for I/O. [03:25:58] That seems likely. [07:32:41] *** Quits: alekseymmm (050811aa@gateway/web/freenode/ip.5.8.17.170) (Ping timeout: 256 seconds) [07:49:30] *** Joins: lhodev (~lhodev@66-90-218-190.dyn.grandenetworks.net) [08:17:29] KipIngram: most of us are at a conference yesterday and today so sorry for lack of responses [08:20:38] *** Joins: travis-ci (~travis-ci@ec2-54-91-246-157.compute-1.amazonaws.com) [08:20:39] (spdk/master) pkgdep: Prefer make -C to cd (Ben Walker) [08:20:39] Diff URL: https://github.com/spdk/spdk/compare/073f2dd8f28d...ba502c17c517 [08:20:39] *** Parts: travis-ci (~travis-ci@ec2-54-91-246-157.compute-1.amazonaws.com) () [08:25:09] *** Joins: travis-ci (~travis-ci@ec2-54-81-185-134.compute-1.amazonaws.com) [08:25:10] (spdk/master) autotest/freebsd: fix contigmem.ko path when using external dpdk (Darek Stojaczyk) [08:25:10] Diff URL: https://github.com/spdk/spdk/compare/ba502c17c517...d68ee5b431ce [08:25:10] *** Parts: travis-ci (~travis-ci@ec2-54-81-185-134.compute-1.amazonaws.com) () [08:26:18] @bwalker: To the contrary, you guys are HUGELY responsive. I think it's doubtful my issues is really an spdk issue, but you've all still tried to help. [08:37:48] *** Joins: travis-ci (~travis-ci@ec2-54-221-34-152.compute-1.amazonaws.com) [08:37:49] (spdk/master) env/pci: do the sleep-after-reset workaround only in primary process (Darek Stojaczyk) [08:37:49] Diff URL: https://github.com/spdk/spdk/compare/d68ee5b431ce...83b1102f7b0f [08:37:49] *** Parts: travis-ci (~travis-ci@ec2-54-221-34-152.compute-1.amazonaws.com) () [09:23:12] *** Joins: pwodkowx (~pwodkowx@134.134.139.72) [09:30:55] jimharris: can you take a look at last comments in https://review.gerrithub.io/#/c/spdk/spdk/+/424973/ just in case you want to add 5 cents to it :) [11:47:47] KipIngram - --enable-debug [11:49:21] and then pass "-L all" to your application [11:49:35] or use -h to show the different options for -L [17:10:01] Hmmm. Well, I happed to leave open a shell into the NVMe controller of our product today - I forgot it was open. But when I happened upon it later it had a bunch of these in it: [17:10:04] _pci_nvme_queue_cqe: Dropping ncmd 0xfcc8610 from cq 0 with ctrlr 0 Not En & Rdy [17:10:25] Seems to be deciding that it needs to drop some commands. That would tie ride in to timeout errors on the host side.