I’m using this page to maintain a list of useful links and notes about Amazon SageMaker Processing as I learn about them. Contents are subject to change.
Resources
References
AWS Blog Posts
Performing simulations at scale with Amazon SageMaker Processing and R on RStudio This post is especially interesting because it includes a section on adapting an R script for SageMaker Processing
SageMaker Developer Guide
AWS Samples GitHub repo
YouTube
Amazon SageMaker Processing Notes
Parallel Processing in SageMaker
To process data in parallel using [a container] on Amazon SageMaker Processing, you can shard input objects by S3 key by setting s3_data_distribution_type='ShardedByS3Key' inside a ProcessingInput so that each instance receives about the same number of input objects.
From the S3DataSouce API reference, S3DataDistributionType parameter:
If you want Amazon SageMaker to replicate the entire dataset on each ML compute instance that is launched for model training, specify FullyReplicated.
If you want Amazon SageMaker to replicate a subset of data on each ML compute instance that is launched for model training, specify ShardedByS3Key. If there are n ML compute instances launched for a training job, each instance gets approximately 1/n of the number of S3 objects. In this case, model training on each machine uses only the subset of training data.
Don't choose more ML compute instances for training than available S3 objects. If you do, some nodes won't get any data and you will pay for nodes that aren't getting any training data. This applies in both File and Pipe modes. Keep this in mind when developing algorithms.