Scalable ML Infrastructure <span>for CV Model <br/>Development</span> | SQUAD Tech

Scalable ML Infrastructure for CV Model
Development

SQUAD designed and deployed a cloud-native ML infrastructure for computer vision model development that reduced training time, lowered infrastructure overhead, and improved the efficiency of large-scale data processing.

7x faster model training

from 21 days to 3 days

$3.2M annual cost savings

$2M on GPU compute and $1.2M on storage

55-point drop in preprocessing

from 70 percent to 15 percent of sprint capacity

Client at a Glance

Service Type

MLOps and infrastructure for computer vision model development

Industry

Consumer electronics and smart security cameras

Engagement

Long-term collaboration on AI infrastructure and model development

Region

Global

The client is a global consumer electronics brand that produces smart indoor and outdoor security cameras.

Challenge

The client’s computer vision team was working with a very large video dataset and increasingly complex training pipelines, but the supporting ML infrastructure was not scaling efficiently.

This created several issues:

A 14 PB video dataset required heavy preprocessing, which consumed about 70 percent of the AI team’s sprint capacity.

The training and evaluation cycle for computer vision models took up to one month on average, slowing down iteration and validation.

Uncontrolled training runs led to significant time and cost overruns.

Manual cloud resource management caused compute downtime and reduced cost efficiency.

The client needed an infrastructure layer to scale computer vision development, reduce operational overhead, and improve distributed training and cloud resource utilization.

Solution

SQUAD designed and implemented a cloud-native ML infrastructure for scalable computer vision model development and evaluation.

The main elements of the solution were:

Deployment of a Kubernetes-based ML toolkit with automated resource management, supporting PyTorch, NVIDIA DALI, and OpenMMLab.

Development of a specialized data loading library to streamline frame extraction, data transformations, and multi-modal input for HPC training pipelines.

Implementation of an Infrastructure-as-Code stack for automated AWS pipeline management, with cost-optimized instances and automatic termination controls.

Introduction of a standardized approach for distributed training and evaluation, improving repeatability across the team.

Technologies and frameworks

The work relied on the following tools and platforms:

Core technologies: AWS, PyTorch, NVIDIA DALI, OpenMMLab, Kubeflow, Python

Data processing: OpenCV, Turbo-JPEG, Albumentations, Kornia, FFmpeg

Optimization and monitoring: Optuna, CloudWatch

Results & Impact

technical outcomes

Seven times faster model training

The overall model training cycle was reduced by a factor of seven, from 21 days to 3 days, allowing the client to iterate on computer vision models much faster.

Faster data access and loading

Data fetching time was reduced by four times, from 14 seconds to 3.8 seconds for 8k images, and sensor pipeline data loading became five times faster, from 150 minutes to 30 minutes.

business outcomes

Lower compute and storage costs

Annual GPU-based EC2 costs were reduced by 2 million dollars, and storage costs were reduced by 1.2 million dollars by optimizing 4 PB of processed data out of a 14 PB dataset.

Better cloud resource utilization

Cloud resource utilization was stabilized at about 80 percent load on GPU-based instances, improving efficiency and reducing waste across training workloads.

customer outcomes

Faster delivery of AI-based features

By shortening training and evaluation cycles, the client was able to move computer vision features through development more quickly and support a broader product roadmap.

More engineering time focused on model development

Dataset preprocessing effort fell from 70 percent to 15 percent of sprint capacity, giving the team more time for model quality and feature development.