设计工具
应用程序

美光®固态硬盘,WEKA™,AMD EPYC™和超微

瑞恩·梅雷迪思| 2023年1月

沙巴体育结算平台 - 1115 - cs - tnr 用于WEKA集群节点的系统. These platforms take advantage of 4th Gen AMD EPYC CPUs along with a PCIe® Gen5 backplane. 测试系统的具体情况如下:

\n

AMD第4代EPYC 9654P CPU(96核)

\n

12微米DDR5 4800MT/s rdimm

\n

10x 美光7450 NVMe固态硬盘

\n

2x 英伟达®Connectx®6 对200年gbe网卡

\n

We deployed this solution taking advantage of 微米 DDR5 DRAM, which provides increased performance 和 throughput 和 faster transfer speeds than previous-gen DDR4.

\n

We also used the 微米 7450 NVMe 固态硬盘 — built with 微米 176-layer with CMOS under the Array (CuA). It combines high performance with excellent quality of service, providing superior application performance 和 response times.

\n

对网络, we used NVIDIA ConnectX-6 对200年gbe网卡 with 2 NICs per storage node 和 1 NIC per client. We recommend using the PCIe Gen5 400Gbe NVIDIA ConnectX-7 NIC when it becomes available to simplify the network configuration 和 deployment with similar performance.

\n"}}' id="text-bb38e3a91c">

Next-gen AI storage: 美光®固态硬盘,WEKA™,AMD EPYC™和超微

For Supercomputing 2022, the 微米® Data Center Workload Engineering team, WEKA, AMD超微型计算机 joined forces to be the first to test 4th Gen AMD EPYC platforms in a WEKA distributed storage solution for AI workloads.

We deployed a solution that took advantage of the best in bleeding-edge hardware 和 software 和 used a new benchmark from the MLPerf™ storage working group to measure its ability to support dem和ing AI workloads.

当我第一次在领英上发布这项工作时, I learned that this group was the first to test MLPerf storage at scale 和 the first to test WEKA on AMD Genoa processors. Liran Zvibel (co-founder 和 CTO at WEKA) commented that he was pleased this process had gone so smoothly 和 that there is often some difficulty “running for a first time on a completely new platform (new PCIe® bus, 新CPU, 等).”

WEKA版本4 exp和s its software-defined storage stack to increase scalability 和 performance per node, 这是利用下一代系统的必要条件. 根据WEKA的说法,它还:

Is a data platform designed for NVMeTM 和 modern networks.

Improves performance for b和width 和 IOPs, with reduced latency, 和 metadata.

支持s broad, multiprotocol access to data on-premises or in the cloud.

Is faster than local disks for mixed workloads 和 small files without requiring tuning.

超微公司提供了其中的6个新沙巴体育结算平台 - 1115 - cs - tnr 用于WEKA集群节点的系统. These platforms take advantage of 4th Gen AMD EPYC CPUs along with a PCIe® Gen5 backplane. 测试系统的具体情况如下:

AMD第4代EPYC 9654P CPU(96核)

12微米DDR5 4800MT/s rdimm

10x 美光7450 NVMe固态硬盘

2x 英伟达®Connectx®6 对200年gbe网卡

We deployed this solution taking advantage of 微米 DDR5 DRAM, which provides increased performance 和 throughput 和 faster transfer speeds than previous-gen DDR4.

We also used the 微米 7450 NVMe 固态硬盘 — built with 微米 176-layer with CMOS under the Array (CuA). It combines high performance with excellent quality of service, providing superior application performance 和 response times.

对网络, we used NVIDIA ConnectX-6 对200年gbe网卡 with 2 NICs per storage node 和 1 NIC per client. We recommend using the PCIe Gen5 400Gbe NVIDIA ConnectX-7 NIC when it becomes available to simplify the network configuration 和 deployment with similar performance.

weka device shown next to micron ssd on black background

基线结果

We tested FIO performance across the 12 load generating clients to measure maximum system throughput, scaling from 1 to 32 queue depth (QD) per client across all clients.

1m sequential read throughput horizontal line chart in green
1m顺序书写绿色横图

We reached 142 GB/s for 1MB reads 和 103 GB/s for 1MB writes. The write throughput is staggering when accounting for the erasure coding 4+2 scheme that WEKA uses. This is enabled by extremely high compute performance from the 4th Gen AMD EPYC CPU 和 the increased performance of 微米 DDR5 DRAM.

4k随机读iops绿色横图
4k随机写iops绿色横图

在随机工作负载上,我们测量了6.3百万4KB读IOPS和1.700万4KB随机写IOPS. These reflect excellent small block r和om performance from the cluster, which is enabled by the performance 和 latency of the 微米 7450 NVMe 固态硬盘 along with WEKA’s focus on better than local small block NVMe performance.

AI/ML工作负载:MLPerf 存储

The MLPerf storage benchmark is designed to test realistic storage performance for AI training for multiple models. It uses a measured sleep time to simulate the time it takes for a GPU to request data, 处理它, 然后请求下一批数据. These steps create an extremely bursty workload where storage will hit its maximum throughput for short periods of time followed by sleep. There are some major advantages to this AI benchmark:

  • 是否关注AI/ML中的存储影响
  • 具有现实的存储和预处理设置
  • 不需要GPU加速器运行
  • Can generate a large data set per model from seed data

我们测试了以下设置:

  • MLPerf 存储 v0.4(预览)
  • 工作内容:医学影像分割训练
  • 模型:Unet3D
  • 种子数据:KiTS19组图像
  • 生成的数据集大小:2TB (500GB x 4)
  • 框架:PyTorch
  • 模拟GPU: NVIDIA A100
line chart in green 和 blue showing throughput speeds

One important aspect of this benchmark is that each MLPerf Process represents a single GPU running the AI training process. Scaling up MLPerf storage processes reaches a maximum throughput of 45 GB/s; however, the per process performance begins to decrease at around 288 processes. That data point represents 288 NVIDIA A100 GPUs running Unet3D Medical Image Segmentation training processes simultaneously, 或相当于36个NVIDIA DGX A100系统!

你想知道更多吗?

一定要查看以下资源:

存储解决方案架构总监

瑞安梅雷迪思

瑞安梅雷迪思 is director of Data Center Workload Engineering for 微米's 存储 Business Unit, testing new technologies to help build 微米's thought leadership 和 awareness in fields like AI 和 NVMe-oF/TCP, along with all-flash software-defined storage technologies.