[00:32:57] *** Quits: darsto (~darsto@89-68-126-89.dynamic.chello.pl) (Ping timeout: 240 seconds)
[00:33:44] *** Joins: darsto (~darsto@89-68-126-89.dynamic.chello.pl)
[01:39:41] *** Joins: tkulasek (~tkulasek@134.134.139.83)
[02:31:09] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 252 seconds)
[07:28:25] <peluse> jimharris, bwalker - sort of a long email on crypto just now but at the end I mention one issue I have yet to get a handle on and wondering if you have any high level ideas on why I can pass bdevperf w/verify on everything except NVME+split even with just one split disk.  Suggestions welcome... gonna work on some other stuff today that I'm falling behind on
[07:29:41] <peluse> I was thinking something maybe to do with my block offset handling but it all looks good and even without split bdevperf wouldn't pass if that was buggy anyway. Also thought about maybe ordering of things but I'm confident I never complete a read or a write until all of the steps are done
[07:52:35] <jimharris> hi @Shuhei - good investigation :-) yes, we offset by 1 so that the bdev module can conveniently uses its context as the io_device, and many times the bdev structure is at the beginning of that context
[07:53:20] <jimharris> then the bdev layer, when it creates the bdev-generic io_device, we are confident it will pick an address that the bdev module will not have used for its io_device
[08:01:07] <jimharris> peluse: ack - have a meeting for next hour then will take a look
[08:01:57] <peluse> jimharris, thanks - no big hurry.  Have a bunch of typing and other busy work to do today anyway :(
[09:52:12] <jimharris> peluse: just posted some comments
[09:52:37] <jimharris> the really key one that you should look at immediately is whether your spdk_dma_malloc operations are ever allocating a buffer that spans a 2MB boundary
[09:52:55] <peluse> jimharris, thanks will do!
[10:14:04] <bwalker> peluse: another thought on crypto
[10:14:14] <bwalker> is there a way to determine what the real internal parallelism is for a device?
[10:14:29] <bwalker> for instance, on an NVMe device you may have queue depth 1024 available in the queue
[10:14:46] <bwalker> and our NVMe driver may allow you to queue up 64k requests in software
[10:15:03] <bwalker> but in reality, you get max performance around ~128 queue depth for reads on most NAND SSDs
[10:15:05] <peluse> yes, we can tell what the max Q depth is an whether or not it supports HW offload or not
[10:15:11] <bwalker> or somewhere between 4 and 8 on Optane
[10:15:40] <bwalker> the max queue depth is quite a bit different though - that's the maximum number of things it can queue up
[10:15:59] <peluse> not sure where you're going with that but keep in mind for a 512B device we can fill up the max Q depth with just 1 1MB IO
[10:16:11] <bwalker> I'm looking for the minimum number of operations that need to be queued up to achieve max performance
[10:16:42] <peluse> we'd have to measure that ourselves - we are not operating in the same model as other users of cryptodev
[10:17:12] <bwalker> the reason I ask is because that's the real number of mbufs you need to allocate per channel
[10:18:32] <bwalker> using more than that is just negatively impacting the CPU cache
[10:18:58] <peluse> actually we need one per crypto IO for reads and 2 for writes and given that all ops are 512B I think I've got the minimum set in the driver now to something pretty close/generic
[10:19:21] <bwalker> it's probably different for every crypto device (just like it's different for every NVMe device), but they're probably all within the same ballpark
[10:19:24] <peluse> but yeah I see what you're saying
[10:19:53] <bwalker> it might be 128ish or something
[10:19:54] <peluse> yes it certainly is but we only support one out of the gate :)  It also depends ont he LBA size of the bdev under us
[10:20:03] <peluse> 128 what?
[10:20:11] <bwalker> 128 queued up crypto requests
[10:20:20] <peluse> that's pretty tiny
[10:20:54] <bwalker> I think it's worth measuring the actual performance of using a very small number of queued crypto requests - like 128
[10:21:07] <peluse> could be though.  could be 64, could be 1024 too.  Also the busrt size for dequeue is variable as well both in terms of how many opswe need for that and how long the poller is allowed to run at once
[10:21:11] <peluse> there's a shitload of knobs
[10:21:17] <bwalker> in some ways, that may end up being better than dumping more of them
[10:21:22] <bwalker> due to caching effects
[10:21:44] <bwalker> what's the queue depth on a QAT device?
[10:24:06] <peluse> can't remember off the top of my head
[10:24:47] <peluse> I'd have to fire it up and look at the device info struct.  I *think* its like 2048, same as the Sw driver.  The SW driver has 8 QP, a QAT device has 32 devices and 2 QP each
[10:25:20] <bwalker> interesting that the sw device doesn't support more queue pairs
[10:25:27] <bwalker> it's all software, so it's entirely arbitrary
[10:26:10] <bwalker> I doubt QAT actually continues to benefit from more queued up operations beyond a fairly small number - probably don't need the full 2048 filled to get max performance
[10:28:20] <peluse> yeah, the issue is that 1 64K IO = 2048 crypto operations on a 512 block device. So we either fire them all at once (and if full drain it so we can spit the rest of them out) or we queue up the crypto operations in some manner and add a lot of latency to that IO as we'd be waiting for the poller to drain them.
[10:29:11] <bwalker> well I don't know where exactly the thresholds sit, but I think you're definitely going to need to be able to handle splitting bdev_io
[10:30:24] <bwalker> also, isn't 64k / 512 just 128?
[10:30:33] <peluse> right now I don't.  If I get a 1MB IO, for example, I'll build of the crypto ops and shove them in 1 Q depth (2048) at a time, process them, then do the next until the IO is done.
[10:30:49] <peluse> shit, didn't mean 64K sorry
[10:31:06] <peluse> 1MB
[10:31:28] <bwalker> as you put them in one at a time - are you spinning in a loop or are you waiting on the poller?
[10:31:47] <peluse> spinning, calling the poller manually from the submission code
[10:32:10] <bwalker> that's not going to work - a VM could submit a 1MB I/O to vhost, which would effectively DDOS the other VMs
[10:32:56] <peluse> hey, sorry, I have some mgmt types waiting on me to craft an email that I need to get done here in the next 30 min but do want to hear more for sure! I have some other ideas but have been focused on getting what's there to work correctly first so I have a solid baseline
[10:33:07] <bwalker> yeah not a problem
[10:33:11] <peluse> gracias man!
[10:53:59] <peluse> jimharris, your thought on the dma alloc seems to be right on the money.  As a quick hack I changed the alignment to 1MB and am running nvme w/split for the last 3-4 min without issue, before it would fail in a few seconds.  Will let it run for a few hours and assuming it still works will be curious as to what was happening :)
[10:55:47] <jimharris> if the buffer spans a 2MB boundary, it may not be physically contiguous
[10:56:42] <jimharris> obviously picking 1MB alignments doesn't work, but that was a good way to prove this theory!
[10:57:05] <jimharris> wait
[10:57:21] <jimharris> is this using QAT or the software PMD?
[10:57:49] <jimharris> my theory only applies if you're using QAT
[11:03:51] *** Quits: tkulasek (~tkulasek@134.134.139.83) (Ping timeout: 240 seconds)
[11:10:09] *** Joins: travis-ci (~travis-ci@ec2-54-224-104-71.compute-1.amazonaws.com)
[11:10:10] <travis-ci> (spdk/master) scripts/rpc.py: pass named args to log.py and net.py (heluwei)
[11:10:11] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/716ad0e389c3...66a91a04980d
[11:10:11] *** Parts: travis-ci (~travis-ci@ec2-54-224-104-71.compute-1.amazonaws.com) ()
[13:17:32] *** Joins: alekseymmm (bcf3adf1@gateway/web/freenode/ip.188.243.173.241)
[15:58:02] *** Quits: alekseymmm (bcf3adf1@gateway/web/freenode/ip.188.243.173.241) (Quit: Page closed)