The 2022 version of the annual Rice Energy HPC Conference was back in-person. The attendance was still down significantly, but many of the familiar faces from the Energy sector showed up and the conference’s presentations were for the most part on par with previous years. One item to note is the change in the actual name of the Conference from the Rice Oil and Gas HPC conference to the Rice Energy HPC conference. This is a clear sign of the changing times within the sector. Many of the largest Oil and Gas companies have started to promote their move to supporting other forms of Energy (Wind, Solar, Biomass, etc.), and this name change was a direct result of that. Since many of the Oil and Gas companies sit on the committee, clearly this was an opportunity to promote this changing of attitude.
The conference followed its familiar two-track formula from past years. The tracks, while not specific for the entire program, did seem to be arranged so that no two common type presentations (hardware, software, ML, HPC, etc.) would overlap helping to prevent someone from encountering the common dilemma of having to choose which presentation to attend. As such, I was able to attend all the presentations that were really of interest to me.
The show started without the traditional keynote but instead went straight into two “birds of a feather” presentations that would set the tone for the conference. Both presentations had multiple major Energy companies giving their thoughts on the state of computing directions within their respective companies and the industry. The first focused on exascale computing and the other on the introduction and growth of Machine Learning and other non-traditional HPC computing requirements in their companies.
As our company does not focus on exascale computing, I attend the latter titled: “HPC and ML Aspects of the Energy Transition.” The session included BP, Shell, TotalEnergies, and Aramco detailing each company’s plans and migration activities to embrace more energy types in their portfolios and the new compute requirements and challenges these shifts were causing. The two most pervasive thoughts were challenges around shifting HPC resources from traditional seismic processing workloads to CFD and Machine Learning workloads. The new requirements were causing problems in both requiring different hardware as well as staffing challenges. The staffing needs were particularly challenging with the software being so drastically different, making it hard to retrain existing programmers. In addition to this was the fact that it was near impossible to compete against the “sexy” technology companies for hiring younger generation AI programmers and convincing them to join an Oil and Gas company. Hence the heavy focus on rebranding themselves as “Energy” companies.
IMAGE: Dave Driggers and Ken Arciga at the Cirrascale Cloud Services booth showing technology from Inspur and solutions from its partners Cerebras Systems, Graphcore, and NVIDIA.
A quick comment on the sponsors/exhibitors, the same regular faces were all there: Intel, AMD, HPE, Dell, DDN, and Lenovo (replacing IBM), but there was also a host of new technology and cloud providers as well. AI/HPC Accelerator companies like NEC with their Vector Engine PCIe card, SambaNova with their RDA, and Cerebras Systems with their Wafer Scale Engine. Cloud and AI were also represented well with Cirrascale, Oracle Cloud, Microsoft Azure (unstaffed tables), Rescale, CloudyCluster, and Run.ai.
The next session of major interest to me was the following morning. The session focused on Cloud and the Energy sector titled: “Cloud-Native Embarrassingly Parallel Workloads.” This was a very lively panel discussion with both Cloud Service Providers and Oil and Gas service providers participating and included questions from the audience. The clear outcome of this discussion was surprising to me. The Energy companies and their service providers were heavily leveraging clouds but were fairly unhappy with the offerings in general. Across the board, they felt there was a clear need for cloud and a willingness to invest in altering the way they use compute, but that their hard needs were not being met or listened to by the biggest providers. The key takeaway seemed to be that the cloud providers were not offering the scale-out required, and especially the storage performance required, to enable scaling the workloads out. I was shocked by the repeated requests for enabling thousands of nodes tightly connected to act as one cluster be it for CPUs or GPU-based workloads. I wish I had $20-30 Million lying around to stand up a seismic-specific GPU cloud for what these guys are clearly begging for.
The last really compelling session I wanted to discuss was titled, “Benchmarking Considerations for Current Energy HPC Systems.” The synopsis of the session was fairly straightforward, benchmarking new hardware architectures for future procurements. The “hook” was that it was to include “evaluation of Machine Learning (ML) applications and ML-oriented architectures.” Clearly, the hook worked because the session was standing room only. It was in the smaller auditorium, but even the moderator was shocked at how full the room was. I got to the session about 5 minutes early but the speaker, Mauricio Araya-Polo, was already presenting. I think he knew his presentation was going to take longer than the allotted time, so they got going early. Mauricio broke the talk into two primary pieces, how the testing was constructed and the results from “typical” architectures, and then from “novel” technologies. Mauricio spent substantial time discussing how important it was to create a level playing field for all the testing and choosing a code that could be replicated across all platforms with as few platform-specific “optimizations/alterations” as possible. Mauricio then showed the results from all the “typical” compute devices; Intel CPU, NVIDIA V100, NVIDIA A100, AMD Mi100 and Mi200, and finally SambaNova. I was surprised to see SambaNova included on this slide, and actually took a picture of the results which were sorted by results from batch sizes of 4, 32, and 128. The next slide included even more accelerators adding the NEC and Fujitsu accelerators, but before I could snap a picture, another attendee was angrily chastised that pictures and videos were not allowed. Apparently, Mauricio had started the session telling everyone to not take pictures and videos in agreement with the vendors who all agreed to let the results be communicated, but not recorded. For this same reason, I will not post the picture I took. Mauricio did comment he has a paper being published that will include the latest results from the vendors after they have had final opportunities to improve their results where possible. Mauricio then thanked all the respective vendors for their hard work making the testing as open as possible. Then he moved on to “novel” technologies, and he made the disclosure that they had bought this technology for their development center in Houston. The next slide called out a single “novel” technology, the Wafer Scale Engine from Cerebras Systems. TotalEnergies purchased a Cerebras CS-2.
The second half of the session was focused on the Cerebras CS-2. Maurcio was very detailed in describing the CS-2 architecture and why it was a major departure from the previous architectures. He also pointed out that the previous testing was normalized to a single accelerator i.e., one GPU, one CPU, one add-in card, vs. the CS-2 being one entire wafer in a system, effectively 80+ accelerators. Mauricio was particularly happy with the bring-up of the CS-2 system commenting that he had never experienced an easier install and bring-up of new technology. Mauricio also talked about the work necessary to adapt the code to the novel architecture, commenting that was straightforward due to the “framework,” which I believe was the SDK provided by Cerebras. The session then pivoted to the results. Just like the press release from earlier in the day, over 100X performance gains were demonstrated with “normalized” testing. Further gains were discussed with platform-specific optimizations that could be accomplished on the CS-2. During these “optimization” activities, he pushed the architecture to its limits and commented that this was the first architecture in 15 years that memory speed was not a bottleneck on. Finally, he closed on how the architecture was doing on initial AI workloads TotalEnergies has started needing as part of their workflow. He seemed happy with the results even though the CS-2 was barely being utilized. If I remember correctly (sorry could not take pictures), he was only able to use 60% of the cores at only ~40% of the cores capability, but still got results he was pleased with since AI workloads are still a small percentage of their workflow.
Overall, I was super happy to be face to face, with so many of the people I have worked with over the past 20+ years. I missed not having many of the international folks attend, but most of the domestic companies were well represented. With oil prices as high as they are, I anticipate this will be a great year of technology investing by the Oil and Gas companies themselves as well as all the service providers trying to help them become Energy companies.
Cirrascale specializes in providing all the major accelerators to customers via the cloud. This includes NVIDIA A100 80GB HGX systems with 200Gb HDR attached WekaIO storage as well as the Cerebras CS-2 via our Cerebras Cloud @ Cirrascale platform. Contact us today for more information.