Thanks again for the time and effort you are investing in this response. I realize that in a likelihood I am missing something obvious.
My apologies in advance for any bugs in my drawing of the right side (green side) of the diagram. That part is working in my test setup, as is the pfSense router/DHCP VM. It has been a while since I looked at the green parts in detail, having focused on the other colors instead, which I can't model in my test setup.
I have pfSense connected to a regular vSwitch on the LAN side of pfSense on a single machines and I can connect VMs to that LAN vSwitch via DHCP just fine.
In practice, there will be more VMs connected to the Distributed Switches on both ESXi hosts than just four total. I drew four total to show what the use cases are and because I don't have space on that page to draw 20-30 VMs.
You are correct that the vmKernel port on the green side currently only supports management. I suspect that there will need to be vmKernel ports connected the Dswitch A and Dswitch B that support vmMotion, etc. What I don't know, and have not been able to find on the Internet, is which of the 5 or 6 properties that one can assign to a vmKernel port is used to move data between Distributed Switches. Is it "management"? Is it "vMotion"? Is it something else? /Something/ has to move the Distributed Switch data between two machines in a cluster. What is that something? I suspect that Distributed Switches between two cluster members are somehow connected via vmKernel ports on each cluster member using one of the set of properties one can enable for a vmKernel port. I would love to know which of the properties this is, because I think it would make things much clearer in my mind.
One reason why I went to the egress via the 2x 40Gbps NICs is because that allows us to run, for certain load testing scenarios, one workload VM on each ESXi host using the full set of computing resources that each host provides and to closest model how two systems running the software might be connected in real life. Also, we managed to pick up the 2x 40Gbps NICs for a very reasonable price. Alas, a QFSP+ switch exceeds the budget for this project. Hence the DACs.
Reading your response carefully, I believe one area where I may have gone wrong in that I was under the impression that I had to assign IP addresses to the interconnect NICs.
Finally, to answer your question about why we went the DHCP route for the workload VMs in the first place, it was to allow easy migration of the VMs both between the two /24 flowing into the two cluster members and to also allow for easy running of some of the VMs at a DR site that has yet another IP address space, all without having to manually renumber each VM. If VMs are moved between hosts, sites, and address spaces, all we have do is to load the corresponding IP address for that VM and that site into DNS and at most an hour later that VM will be up, running, and accessible. Be it on the other cluster member or on some server half a world away.
(We plan to use Veeam for the failover, which was recommended to me in a previous inquiry to this forum and which based on my limited testing so far seems to do what we are looking for regarding enabling DR. As long as the VMs don't require manual renumbering when being run from another site).
I will mark your previous answer as best answer and again thank you for your time. If you can think of any other advice or happen to know the answer to the question how exactly Distributed Switches move data between cluster members, I would love to know the answer.
I am deeply appreciative of your effort and your detailed responses!!!
--Marc