As a cloud service provider, one of the most common bottlenecks our customers run into is trying to keep their hungry GPUs fed with data. Since a majority of deep learning applications have been optimized to shift their processing from CPUs to GPUs, this has become an even larger focus. The issue only compounds itself when you actually begin to equate the amount of time wasted either from data scientists waiting for a training run to complete, or from the amount of money spent on cloud resources not being fully taken advantage of, due to GPUs being underutilized or idle while waiting for data to be ingested and prepped for training.
Over the past year, we've been focusing a lot of our efforts into helping our Autonomous Vehicle and Medical Imaging customers streamline their workflows and ensure their Cirrascale GPU cloud instances are being keep busy. In general, here is how we've helped to accelerate our customer's training times by ensuring their GPUs are operating at their peak in our cloud:
Storage System Options
General object or file storage systems offered on most cloud providers won't even come close to cutting it in terms of performance for your AV or Medical Imaging workflows. Serving up millions of small, random files to your GPU-based training servers is a science. Cirrascale offers high throughput hot-tier storage offerings that can push up to 20GB/s (yes, that's GB not Gb) to EACH client. We work to ensure that storage I/O is not an issue when keeping your training, simulation, or re-simulation jobs flowing by offering storage system options that fit your desired workload. If networked hot-tier is not required, all of our servers -- even base models -- have lightning fast local NVMe storage devices.
Network Infrastructure Options
Moving massive amounts of data can be challenging for some providers due to the way they have configured their networks. Our network infrastructure was deployed to deliver high-bandwidth and low-latency for transporting data. Need 20 servers connected together via 100GbE or EDR InfiniBand? How about 50 servers at 200GbE? Our customers have told us that the large cloud providers just can’t deliver it. They struggle with any type of scale when it comes to networking where Cirrascale has the ability to be flexible and provide fast server interconnects that scale and keep data moving to where it matters most. Additionally, we don't charge any egress fees on data, so multi-cloud deployments won't be a painful experience.
Sometimes keeping jobs scheduled and moving through the workflow can be an issue in and of itself. We help customers manage their jobs by deploying various open-source container orchestration tools, like Kubernetes. These tools are perfectly suited for staging data, automating job queues, scaling and management of GPU servers. By deploying these tools, customers are ensured that queued jobs are handled right away and therefore their GPUs are staying active even in off-hours when employees may be sleeping. Cirrascale engineers can even help to manage these tools for customers to provide a more hands-on approach for clients that need more guidance.
These aren't the only bottlenecks and processes to optimize in the various workflows we've come across, but they are some of the most noticed. If you're interested in finding out what makes Cirrascale Cloud Services different from other cloud providers, or if you'd like to review your workflows with us to see how we can help, we would like to hear from you.