Skip to contents

tl;dr: cloud-based object stores, not cloud-based compute, are the game changer.


A commonly repeated but often misguided quip about cloud-based workflows is

send the compute to the data, not the data to the compute.

It may seem intuitive that upload a few kb of code is better than downloading terrabytes of data. However, this logic can easily be deeply misleading, even if it is lucratively profitable for cloud providers.

If you are fortunate enough to care nothing about costs or have someone else covering those costs, the technical logic of this statement is accurate, though even then not in the way it is often understood. The rest of us must consider the financial cost of renting compute from a cloud provider far outweighs charges for data egress. But wait! Isn’t it just unfeasible to download terabytes of data? That is where this quip is particularly misleading. Cloud native workflows access data using range requests rather than downloading the entire dataset to the disks of a local filesystem – that is, we never really ‘download’ entire files, but instead ‘stream’ only the parts of each file we are actively using – just as we consume Netflix by ‘streaming’ the data to our local computers rather than waiting for a download that must be stored on a harddisk. Anyone with a network connection capable of streaming video can also stream research data.

What about all the computing power? Many users of compute instances from cloud providers are in fact renting virtual machines with less computational power than the laptops or desktops (even cell phones) they already own. Moreover, well-designed cloud-native data workflows frequently avoid onerous demands on disk speed, memory, or even cpu threads. The limiting resource in most cloud-native workflows is network bandwidth. If you have a network connection capable of streaming high-definition video, you probably already have a setup that will allow cloud-based computing.
Even if you don’t, small virtual machines are offered in a free tier by most major cloud providers. GitHub Codespaces is particularly easy environment for anyone to get started using. Free-tier machines may have limited resources – only a few cores, a few GB of RAM and limited disk space – but they almost always have high bandwidth network with speeds easily above 100 Mb/s range – ideal for cloud-native workflows.

This is not to say that it never makes sense to ‘send the code to the data’. As network bandwidth is often rate-limiting, anyone in search of the best performance will naturally want to seek out the fastest network connection to the data – the local area network of the specific regional data center housing the data. Inside the data center, the network is the fastest and most reliable (dropped packets or timeouts can be significant issues elsewhere). Even small compute inside the center can have impressive performance, and it will be the compute you can afford rather than the network speed that can hold you back.

Pricing reflects this reality. Inside the data center, there are no bandwidth charges for accessing the data, because data never ‘egresses’ from the data center over the internet. In contrast, sending large amounts of data over public internet networks is not free to Amazon or the other cloud providers, and so they pass these charges onto their customers as egress rates (around $0.02/GB in us-west-2). Amazon is happy to levy these charges against either the requester or the provider of the data (and may waive the charges in the case of some public datasets – of course as a consumer it is not possible to distinguish this from a provider-pays contract).

Earthdata

NASA EarthData has taken a somewhat novel approach to this situation. To allow users to access these publicly funded data resources without paying Amazon for the privilege, NASA has created an EarthDataLogin system that routes public https requests to these data products through a system of redirects in a CloudFlare content distribution network. Adding this extra routing gives NASA a mechanism for imposing rate-limits – reducing egress costs by throttling the rate or amount of data any one user can access. (AWS does not offer data providers the ability to set such rate limits directly.) This routing requires users to register and provide an authentication token, both which are freely available through the EarthDataLogin platform and API. I prefer this mechanism as the default, because code written to use this mechanism is portable to any compute where a network connection is available. This approach actually supports a two different implementations. One approach is to use HTTP Basic Auth, supplying a user name and password (typically using a .netrc file). Another uses a Bearer <token> in the request header, an authentication mechanism introduced by HTTP OAuth 2.0.

If users are willing to pay Amazon for compute inside us-west-2, they can take advantage of extremely fast network without paying egress charges (since no data egress occurs.) Perhaps surprisingly, this internal access still requires authentication to generate AWS access tokens (an id, secret token, and a session token). These tokens will only work from compute inside the AWS data center – trying to use these tokens from other machines will throw an Access Denied error. More inconveniently, these tokens are specific to each of NASA’s 12 different DAACs (despite all 12 DAACs using us-west-2 to host their data. This gives the DAACs some insight into use statistics that could also be obtained by the S3 logs anyway). Most inconveniently of all, these tokens expire after one hour and must be renewed, potentially interrupting precisely the kind of intensive, long-running computations a user would want to be inside the local area network to run.