thunderbolt-ibverbs: We have InfiniBand at home

(blog.hellas.ai)

113 points | by zdw 2 days ago ago

15 comments

  • l1k 9 hours ago

    This is using Thunderbolt networking as transport, which incurs a bit of overhead.

    But starting with the upcoming Linux v7.2, there's a new feature called USB4STREAM to use raw Thunderbolt packets as transport with minimum overhead and a super simple user interface:

    https://lore.kernel.org/r/20260511102744.1867485-1-mika.west...

    Release of v7.2-rc1 is predicted for Jul 5, that's when this will first be available as a tarball. Until then you have to clone from thunderbolt.git/next:

    https://git.kernel.org/pub/scm/linux/kernel/git/westeri/thun...

    Or alternatively linux-next:

    https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-n...

    Press coverage:

    https://www.phoronix.com/news/Intel-Linux-USB4STREAM

    • grw_ 5 hours ago

      author here! it's not on top of USB4NET, no (RXE can already do that, it's compared in the benchmarks). it's built with the same tb primitives as the networking stack obviously, just assembled differently to emulate a verbs device instead of a nic. happy to answer any other q!

      • eqvinox 3 hours ago

        A bit unclear, maybe I missed it in the article: how much of InfiniBand is there? Is it just the verbs interface? Or are there some (higher) layers of InfiniBand actually carried as bits on the cable? It can't be "bridged" into actual InfiniBand, right?

        • grw_ an hour ago

          I actually didn't know there was more to InfiniBand than verbs (at least at this abstraction level, above PHY), so probably the answer is 'not much more'. The device imitates a RoCE V2 device and the higher level abstractions I used on top were GPU-ish libraries like NCCL and JACCL.

          Good q about 'bridging into actual InfiniBand', I don't know the answer there either. My naive understanding would be that: since this is host-initiated RDMA (it's still the host cpu invoking into dma buffers, though they may be device-memory mapped), actually it should work fine, at least between two machines? I'm curious enough to try- I have a couple of machines with thunderbolt AND RoCE-capable NICs- the experiment is to see if we can use this across diverse transports simultaneously? I think this is what it does already (since the MacOS FA57 vs linux native are already 'different transports'), but say if you have a better scenario to demonstrate what 'bridging into actual infiniband' would look like!

    • eqvinox 7 hours ago

      > This is using Thunderbolt networking as transport,

      Are you sure? It doesn't sound like it in some places in the text, e.g.:

      >> a kernel driver that sits alongside thunderbolt-net, allocating DMA rings from the controller's NHI port in the same way

      but I don't have the domain knowledge to tell…

      • adrian_b 6 hours ago

        Yes, the description from TFA does not match the traditional Thunderbolt networking protocol, whose performance may be as low as that of a 10 Gb/s Ethernet interface.

        The description from TFA matches what the poster above you said about a new Linux device driver that allows access to the raw Thunderbolt protocol for transferring data between computers. This appears to be an independent implementation of the same principle as in the device driver that will be merged in the mainline Linux.

        While the official Linux device driver makes the raw Thunderbolt appear like a file, which can be written and read to transfer data, this implementation emulates an Infiniband interface, which presumably was simpler to use for distributing work over multiple GPUs.

        They actually mention that with traditional Thunderbolt networking on the same computers, they had obtained only 9 Gb/s, i.e. more than 5 times slower than what they obtained with raw Thunderbolt.

        • scottlamb 4 hours ago

          > traditional Thunderbolt networking protocol ... performance may be as low as that of a 10 Gb/s Ethernet interface.

          Ouch. Why so much lower than the physical bandwidth (or what they've achieved here)?

          • grw_ 3 hours ago

            A USB4 40Gbps cable consists of two 20G tx/rx pairs. The in-kernel networking implementation is single-stream and just uses one pair, and won't e.g. stripe across both pairs or across multiple cables, which was the main bandwidth unlock in TFA. Doing so would be a much more complicated undertaking, since now you've re-introduced out-of-order delivery which complicates re-assembly of large packets, retries, handling loss etc. The verbs interface is a lot simpler than that of a full IP stack, so although was possible to get this working across rails, may not be so simple for something pretending to be ethernet.

            • scottlamb 3 hours ago

              > now you've re-introduced out-of-order delivery which complicates re-assembly of large packets, retries, handling loss etc.

              Still confused though. For a standard TCP/IP networking stack, that support is all there anyway, as it's not meant for point-to-point links, and out-of-order delivery is a thing that happens on the Internet. I haven't tried thunderbolt-net, but it says it implements Apple's ThunderboltIP, so I'd expect it's IP-based networking on top, and so it'd all work? Is it that out-of-order delivery is far more common than usual, and this path is so much slower (by impairing LRO/GRO) that it's not worth aggregating at all?

              I'd understand if each pair is logically represented as a separate networking device, and then you have to set up link aggregation on top of that. (And iirc at least with some forms of aggregation a particular flow is bound to one link, so you'd have to have a bunch of streams to actually get bandwidth benefits.) So caveats for sure but I'd expect something to be possible. But does it just not support using both pairs at all?

              Even with using one pair I still don't understand why you'd only get about 10G rather than 20G on a pair. I do see chapter 4 of the (your?) article talks about the single DMA ring maybe imposing the 10 Gbps limit but I don't have any good intuition for why. I don't know say how large the rings are or what latencies to expect on their operations or what packet sizes are supported which might help me understand.

              • grw_ an hour ago

                Yeah, thunderbolt-net is IP on top and it does work as you say, with a few caveats:

                - On a single cable with two rails available, the thunderbolt-net grabs one and uses that. Without patching the kernel, there's no way to make it present a second interface using the remaining pair.

                - If you had a second cable between the machines (for 4 total rails), thunderbolt-net will still only grab one rail, because the abstraction across which it's making the links sees an identical peer at the end of both links and so falls into the same trap as above. There is no LRO/GRO anyway (or it's buggy- I forget) on the linux version.

                - Why you only get 10G rather than 20G on single pair- actually, this might be something specific to the Strix Halo SoC that I was testing on- on a different (still AMD) chipset and an Apple TB5 Mac I did see closer to 22G in one direction, but still 8 in the other. The Strix Halo NHI seems to be 'stripped down' (as expected, for mobile) in ways I don't really understand.

                - Intuition on why- I can't point you to the line number, but I think it has to do with a fixed 4kb page size when communicating with the NHI that ends up becoming a bottleneck, perhaps 16kb pages on aarch64 apple help here?

  • mkesper 12 hours ago

    Kudos for the idea and being fully open to the state of this project (AI code, expect breakage)!

    • kjs3 7 hours ago

      Thanks for saying this. I'm as 'get off my lawn' about AI as any oldster at this point, but if all project were this up front about what and how they're doing things I'd have a lot fewer reasons to grumble.

  • speedbird 2 hours ago

    Nice project but you should be able to get infiniband up between a pair of Linux boxes with cheap adapters / cables off ebayy

    • grw_ an hour ago

      Thanks and yes real infiniband works better I agree, but it's still hundreds of dollars and days (at best!) of time. This gives you 90% of the benefits with a cable you probably already own

  • trumpdong 6 hours ago

    Fascinating. Infiniband is already fascinating, running it on something else is more fascinating.