Cisco Live 2024 Wrap-Up – Cisco Nexus HyperFabric, Disaggregated Scheduled Fabric (DSF) and SONiC

Cisco Live 2024 Wrap-Up – Cisco Nexus HyperFabric, Disaggregated Scheduled Fabric (DSF) and SONiC

Last week, I attended my 11th Cisco Live in person, in the fabulous Las Vegas. This post is my Cisco Live 2024 Wrap-up.

I can already tell you that the next edition of Cisco Live US will be held June 8-12, 2025 in San Diego, California. If my company agrees to send me there, I’m already looking forward to it, because San Diego is a wonderful city.

If you’d like to be notified when the registration opens, you can subscribe here: https://www.ciscolive.com/global/cisco-live-2025.html

But that’s not the main purpose of this post. I would like to share with you the two main topics that interested me most at Cisco Live 2024.

If you don’t know me, I am a data center infrastructure group lead, at an HPC center in Switzerland. And my background is network engineering. So, my main interests are of course data centers and AI/ML infrastructures, which explains my choices below. And yes… I know, AI/ML is very trendy today, but keep in mind that few HPC centers around the world have been doing this for decades, and the world has only been talking about AI/ML for a year or so, thanks to ChatGPT.

But before giving you my Cisco Live 2024 Wrap-up, I would like to share some doubts I have about the AI/ML hype we hear and see everywhere:

Will enterprises really deploy massively AI clusters on-premises?

I have to say that Cisco is making a big effort to help enterprises deploy and operate AI-clusters in their on-premises data centers, and I think this is a good point. I’ll talk about this below. But first, I’d like to explain in detail my doubt about this “AI hype”, and why I’m not convinced that so many companies will have the need, budget, and skilled personnel to deploy and operate a dedicated medium-sized AI cluster. When I say medium size, let’s take an example: I’ve seen recently different use cases and white papers that talk about a 256 GPUs cluster. Let’s see quickly what that means in terms of network, power, and cooling requirements:

  • 256 GPUs means 256 NICs at 400Gb to connect to the non-blocking network. Also known as the HSN or High-Speed-Network in the HPC world, or backend network in the different Cisco presentations and blueprints. That means 16 leaf switches 32-ports 400Gb, where we connect, on each leaf, 16 NICs and 16 uplinks towards the spines. For this, we could use the Cisco 8101-32FH-O (picture below) or the Cisco Nexus 9332D-H2R / 9332D-GX2B.
    And then, 4 spines 64-ports 400Gb, like the Cisco 8111-32EH-O (2nd picture below) or the Cisco Nexus 9364D-GX2A. Or eventually 8 spines 32-ports 400Gb.

 

  • Not to mention the front-end network, which will certainly be a little smaller, to connect the cluster to the external world, and to a substantial storage. If a company needs to process AI workloads, there is certainly a lot of data involved!
  • Finally, in addition to this list, a third network, at lower speed is also needed, to manage the compute nodes, the storage, and the network.

  • Then, there is the power consumption and cooling parts; a single NVIDIA DGX H100 (8xGPU H100) – picture above – consumes 10.2KW, and 256 GPUs require 32 of these servers. Plus, as we saw above, between 30 and 50 switches if we count the three networks, the storage, and a few nodes for the management, the logs, etc. This easily goes up to 500KW. Half a MW, not less! And this energy must also be dissipated with an efficient cooling system.

This is a quick example based on the case studies I saw at Cisco Live. But for your reference, in March 2024 Cisco published the “Cisco Data Center Networking Blueprint for AI/ML Applications” with even bigger examples; for 512 and 1024 GPUs clusters: https://www.cisco.com/c/en/us/td/docs/dcn/whitepapers/cisco-data-center-networking-blueprint-for-ai-ml-applications.html

In this document, they are talking about 32 leafs and 16 spines, so 48 switches 9364D-GX2B only for the backend network.

 

All this to say that I’m not convinced that many companies would invest in such an infrastructure, on-premises. Not to mention also that it needs skills to manage the infrastructure; the network, the storage, and the compute parts. Cisco has a solution for this, I’ll talk about it just below. But a company need also skilled personnel to manage the software side. And when a company has such an investment-intensive infrastructure, it has to be used 24/7.  There’s a lot of work to be done on the software side; to adapt the different applications to the GPUs, as well as the operational work to maintain the scheduler, ensure the workloads are running and well managed, etc.

Cisco Nexus HyperFabric

So, back to my Cisco Live 2024 Wrap-up. The first announcement I heard on Tuesday morning at the keynote was the Cisco Nexus HyperFabric.

In a nutshell, this is a turnkey solution for deploying and managing an infrastructure supporting AI workloads. We’re not very far from HPC here, at a smaller scale. We could call this an enterprise-scale HPC cluster. Maybe I found a new buzzword: E-HPC, for Enterprise High Performance Computing.

But Cisco prefers to speak about an “enterprise-ready AI platform“. Which includes the following:

  • Cisco cloud management capabilities to simplify IT operations across all phases of the workflow.
  • Cisco 6000 series switches for spine and leaf that deliver 400G and 800G Ethernet fabric performance.
  • Cisco Optics family of QSFP-DD modules to offer customer choice and deliver super high densities.
  • NVIDIA AI Enterprise software to streamline the development and deployment of production-grade generative AI workloads
  • NVIDIA NIM inference microservices that accelerate the deployment of foundation models while ensuring data security, and are available with NVIDIA AI Enterprise
  • NVIDIA Tensor Core GPUs starting with the NVIDIA H200 NVL, designed from the ground up to supercharge generative AI workloads with game-changing performance and memory capabilities.
  • NVIDIA BlueField-3 data processing unit DPU processor and BlueField-3 SuperNIC for accelerating AI compute networking, data access and security workloads.
  • Enterprise reference design for AI built on NVIDIA MGX, a modular and flexible server architecture.
  • The VAST Data Platform, which offers unified storage, database and a data-driven function engine built for AI.

Please note that this list is a copy/paste from the Cisco press release.

Despite what I said above about the number of companies that would have a use case for running AI workloads on-premises, I think this product is a pretty good idea. It makes possible to have a turnkey infrastructure, cloud-managed and well-integrated, which is not necessarily easy, as it’s not a very common architecture.

It would be interesting to see it more in detail when the product is out. And also have to see how it will evolve later. The integration of Ethernet between Cisco switches, NVIDIA SmartNICs (DPUs), and NVIDIA GPU nodes, plus the VAST storage is already promising.

But, I have a question that I haven’t found an answer yet: on the product page it says: “Cisco 6000 switches based on Cisco Silicon One“, and then: “800G lossless Ethernet, industry-leading 800G optics, and 51.2 Tbps Cisco Silicon One“. So, this is without any doubt a switch based on the Cisco Silicon One G200. But what is this Cisco 6000 platform?

To the Cisco product page: https://www.cisco.com/site/us/en/products/networking/networking-cloud/data-center/nexus-hyperfabric/index.html

  

Disaggregated Scheduled Fabric (DSF), Cisco 8000 with Silicon One ASICs, and SONiC

The other two breakout sessions that were particularly interesting for me were:

  • Disaggregated Scheduled Fabric: How to Make a Network Fabric Scale for AI/ML Clusters – BRKNWT-3406
  • Ethernet-based Fabric for AI cluster – powered by Silicon One & SONiC, an ultra-high performance, scalable & non-blocking Ethernet fabric. – BRKCOC-3005

The two sessions are complementary. In fact, they talk about the same case study: deploying a backend network (non-blocking) with Cisco 8000 / Silicon One equipment, and SONiC as network OS (NOS), for a 256-GPUs cluster.

If you have access to the Cisco Live portal, I strongly recommend you to take a look at these sessions. They are very interesting. Particularly the packet spraying and re-ordering done at the leaf level. This is a promising evolution of Ethernet. And with the arrival of 800Gbps-based fabrics, I think that Ethernet in the data center, and the future Ultra Ethernet, has a bright future ahead of it.

 


Did you like this article? Please share it…

 

33 Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *