AWS NLP Supercomputing: Omnibond® Fuels Clemson's 1.1M vCPU Record on EC2 Spot Instances for Topic Modeling Breakthroughs
Natural language processing (NLP) is revolutionizing how we uncover insights from vast text corpora—from predicting business trends to informing public policy. But training topic models at scale demands computational firepower beyond most on-premises setups. Clemson University's School of Computing, led by Professor Amy Apon and her team, shattered records by launching the largest cloud-based high-performance cluster in a single AWS region: 1,119,196 vCPUs across thousands of EC2 Spot Instances. This wasn't just a flex—it enabled nearly half a million parallel experiments on topic modeling, analyzing 17 years of computer science journal abstracts (533,560 documents, 32.5M words) and NIPS Conference papers (2,484 documents, 3.3M words). Outputs streamed to Amazon S3 for deep analysis on model convergence, topic quality, and parameter impacts.
The result? A peer-reviewed study that optimized topic modeling for real-world applications, proving AWS could extend Clemson's on-prem Palmetto Cluster without disruption. At the heart of this achievement was Omnibond®'s hybrid orchestration technology, turning complex Spot Fleet management into seamless, autoscaling workflows. Here's how we made it happen.
Clemson's Palmetto supercomputer handles routine research, but massive parameter sweeps for topic models—like testing hyperparameters across datasets—require burst capacity that's impossible on-campus. The team needed to:
- Parallelize massively: Run 500,000+ experiments in parallel, far beyond Palmetto's limits.
- Optimize costs: Use Spot Instances for 90%+ savings, with per-second billing to avoid waste.
- Hybridize effortlessly: Integrate with SLURM for familiar on-prem jobs, federating AWS as an extension of their data center.
Manual cloud setups meant weeks of scripting, risk of Spot interruptions, and no hybrid safety net. Clemson required a toolset that automated provisioning, handled preemptions, and scaled elastically—without YAML nightmares or vendor silos.
Omnibond® partnered with Clemson and AWS to deliver a turnkey framework, leveraging our deep HPC expertise to bridge on-prem and cloud. Key components included:
- Omnibond®'s Advanced Provisioning Engine: Deployed via AWS Marketplace, this technology automated the Spot Fleet supercomputer, dynamically requesting thousands of biddable instances while maintaining target capacity. It handled per-second billing, preemptions, and elastic expansion—ramping from zero to 1.1M vCPUs in minutes. As Boyd Wilson, Omnibond® CEO, noted: "Participating in this project was exciting... seeing how the Clemson team developed a provisioning and workflow automation tool that tied into our technology to build a huge Spot Fleet supercomputer in a single region in AWS was outstanding."
- SLURM Overlay: As the virtual workload manager, SLURM orchestrated data analytics jobs across the fleet. Omnibond® customized it for hybrid federation, allowing Palmetto jobs to burst to AWS without reconfiguration—ensuring zero-trust access and consistent policies.
- Custom PAW Automation: Built with our technology, Clemson's Provisioning And Workflow (PAW) tool—co-developed with our experts—enabled cloud-agnostic parameter sweeps. It integrated with AWS Spot Fleet for autoscaling, preventing overload from 1M+ vCPUs while optimizing for spare capacity. Our team fine-tuned it for NLP workloads, ensuring low-latency data staging to S3.
Omnibond®'s hands-on collaboration was pivotal: We optimized PAW for AWS, providing expertise in Spot management and hybrid integration that let Clemson focus on science, not plumbing.
Launched in US East (N. Virginia) in a single region, the cluster peaked at 1,119,196 vCPUs, rivaling the world's top supercomputers. Highlights:
- Scale: Elastic growth to 1.1M vCPUs, processing half a million experiments in parallel—unthinkable on Palmetto alone.
- Speed: Completed sweeps in hours, with SLURM distributing jobs across instance types for optimal utilization.
- Efficiency: Spot Instances + per-second billing delivered 90% cost reduction, with PAW autoscaling to avoid idle resources.
- Hybrid Wins: Federated workflows let Clemson test configs on-prem before AWS bursts, accelerating iteration.
Professor Apon raved: "I am absolutely thrilled with the outcome... They used resources from AWS and Omnibond® and developed a new software infrastructure to perform research at a scale and time-to-completion not possible with only campus resources. Per-second billing was a key enabler."
The project slashed costs to a fraction of on-demand pricing, freeing NSF-funded resources for innovation. It yielded breakthroughs in topic modeling—quantifying how parameters influence convergence and quality—applicable to AI forecasting, policy analysis, and more. Published in peer-reviewed journals, it showcased AWS + Omnibond® as a blueprint for hybrid HPC.
This AWS NLP triumph informs projectEureka: Our dashboard fuses Omnibond®'s advanced orchestration with data governance and VDI, making 1M vCPU bursts intuitive across AWS, GCP, Azure, and K8s. No silos—just accelerated AI innovation.
Sources: Based on AWS's blog on Clemson's NLP project, highlighting Omnibond®'s technology and cloud integration.
Omnibond's Pioneering AI Solutions: A Spotlight on TrafficVision
Omnibond’s TrafficVision® leverages AI and computer vision to revolutionize traffic management across North America, enhancing safety and …
Omnibond's AI Innovation: Spotlight on BayTracker
Omnibond’s BayTracker leverages AI and computer vision to optimize quick-service operations, enhancing efficiency and customer satisfaction …
Transforming Biomedical Research with Open Biomedical Ontologies and Semantic Web Technology
Omnibond® integrates Open Biomedical Ontologies with Semantic Web technology to revolutionize knowledge-based interpretation in biomedicine.
GCP Hurricane Simulation Scale
Powered Clemson University’s record-breaking 2.1M VCPU simulation on GCP—processing 210TB of traffic video for hurricane evacuations.