I did a lot of tech work on the infrastructure underlying my analytics over the past weeks. I am putting my notes here so they don’t get lost and maybe help someone. Here are three stories, unrelated to each other.

Transferring a few GBs to Amazon AWS

I lately worked on a number of scenarios that involved data analytics on top of Amazon AWS. In one case, I had to transfer a 16 Gigabytes folder consisting of 80,000 files in several subfolders. Most of those are TIFF images, which form the training and test sets for neural networks dealing with image recognition. The classifier is built on Torch (pytorch), which is deep learning platform. It uses CUDA, an NVIDIA API on top of GPU processing. For this reason it must be run on a CUDA-enabled platform, which most laptops are not. One of the options is Amazon AWS service. In our company we use RedHat Openshift environment on AWS. So I had to transfer 80K files from my laptop to an AWS S3 bucket.

That simple task proved surprisingly difficult.

In the early 2000s, I worked on a project called GridFTP for Argonne National Lab . Essentially we improved the good old FTP to become more reliable, more robust, and run in parallel streams so that people could easily transfer a lot of data, and restart a transfer if broken. Then… why, two decades later, couldn’t AWS provide a similar tool?

  • To transfer the 16GB from my laptop to AWS, I first tried S3 Browser, but abandoned the idea quickly. The tool proved quite unstable, and most annoyingly, once a transfer broke, you had to start from zero.
  • My second choice was the commandline s3cmd, but I failed at some obscure installation conflict and I abandoned it
  • The third choice was the S3 CLI interface. It eventually worked fine, and after a few hours all my data was transferred.

However, there are a few obscure things to know when using S3 Cli for tasks that may take hours.

  • you can use the command aws s3 cp, but if the transfer breaks there is no way to restart…
  • for this, the command aws s3 sync is slightly better, with the syntax
    • aws –endpoint-url <url> s3 sync  <source> <target>
  • I first tested this on a 1.3 GB folder of 8,000 files, and stuff was done within 30 minutes. Then I proceeded with copying the entire set
  • But when it came to verifying whether the transfer succeeded, I ended up scratching my head. How to check whether the local <src> and remote <target> folders are identical? There is actually no easy way to do this.
  • Some stackexchange thread recommend the aws s3 sync with option –dryrun, which is supposed to tell you whether the folders match. This is actually wrong advice. I tried this option, and it is meant to do something else. The output is the same, no matter whether the folders match or not.
  • In the end, after the sync completed, I decided to run the same sync again, with the following logic: it will check whether the folders match, a nd if they do, it will not do anything…unfortunately, it is not so
  • According to some literature here and here, the aws sync supposedly compares the size of the file and the last modified timestamp to see if a file needs to be synced… really?
  • Then… why copying the folders with aws sync took five hours, and then running the sync again over the same folders (just for verification, so it was only supposed to compare timestamps) also took five hours?
  • I don’t understand what aws sync is doing and why it takes so long. I first thought it was building checksums of local+remote files. But no, because CPU is almost idle. I then thought it transferred all data to aws in order to comparing it there…. but it is also not so, because network was idle. I think the aws s3 sync is really just comparing timestamps, but in a terribly inefficient way: one file at a time. This is why it took ages.
  • Lesson learned: to compare source folder versus destination, just manually run aws s3 ls >out1.txt, store the output in a file, process it with sed to normalize prefixes, and diff it against the local file listing derived from ls -l>out2.txt. Strange that AWS does not provide a tool for something so basic.

why Amazon SageMaker sucks

I spent a few weeks on building machine learning pipelines using the Amazon SageMaker, and I learned to dislike it. And it is not just due to really bad documentation. Yes, SageMaker docs are awful, but so is the case with other products, so learned to live with community support. But I think there are more problems with SageMaker.

I worked with SageMaker before, using it mainly as Jupyter Lab container, but now was the first time I had a chance to examine its native pipelines API. The pipelines we built were standard MLOps, with following steps: data integration, data cleansing, feature engineering, test/train split, model training and so on.

Taking aside advanced capabilities of the SageMaker API, some basic architectural features turned out surprising. Most importantly, the pipelines took very long to execute. I first thought my code was inefficient, but having spent some digging to isolate the problem I found that even when I reduced my code to zero (literally, zero lines), the pipeline took 4 minutes per each ProcessingStep. This is because SageMaker isolates each step into a separate AWS S3 instance, and launching an instance takes several minutes.

I must say I cannot comprehend the purpose of this idea. This architecture, in my opinion, introduces zero benefits, and several unnecessary problems:

  • my source code becomes very complex, with need of separate script per step, and problematic transfer of data and parameters between steps
  • my execution becomes unnecessarily long, in our case 4 minutes overhead per step
  • my Software Engineering process becomes nasty and irritating, because every iteration of debugging takes minutes, instead of seconds

And what is the gain? Better abstraction and isolation of the steps. But why do we need to isolate the steps to such an extent? I am not even sure. After all, the entire pipeline is typically being executed by the same AWS Identity and Access Management (IAM) user, so there is no need to protect access to data between steps. I have seen many pipeline/workflow solutions in the past (simple cron, various cluster schedulers, or airflow) which were all simpler, nicer and faster to use. In my view, the current Amazon Sagemaker pipeline architecture is a step backwards in regards to older solutions

Why I stopped using Anaconda

That one is interesting… after three years of happily using Anaconda without a glitch, I had my first serious problem which was enough. I probably won’t use Anaconda ever again.

I needed to install several packages, such as pytorch, for the image recognition task. As mentioned, our deep learning classifier built with Torch (pytorch) uses CUDA for the GPU processing, however for the basic development you can have a non-CUDA version of pytorch on a laptop. I installed stuff and saw the notebooks crashing my kernel. After much digging I isolated the kernel-killer code to these three lines:

from PIL import Image
im = Image.open('sample01.tif')

Soon it became clear that the problem results from an underlying library conflict of pillow, the library for reading tiff images, being installed in an incorrect version. Unfortunately the case wasn’t standard because conda would not easily display the core of the problem.

This led to many hours of the fight, trying various combinations of package versions, leading to messages like this:

(pil) C:\>conda install -c anaconda pillow
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Found conflicts! Looking for incompatible packages. 
UnsatisfiableError: The following specifications were found to be incompatible with each other:

But that was the end of the message UnsatisfiableError did not provide any further detail, even if forced with the flag conda config –set unsatisfiable_hints True. I was even unable to determine whether the problem was shallow, or deep, resulting from the fact that several versions of Python are present in my system.

In such a case it would be great to have Anaconda inside a virtual environment, so that I could experiment with various dependencies, in the isolation, without danger of breaking my other projects! But Anaconda, at least by default, is not installed this way.

I ended up uninstalling Pythons from my system, uninstalling Anaconda, and saying goodby to the configuration that worked with dozens of project. Then, on fresh system, I installed Miniconda. This led me to solve the problem easily. I quickly set up a few conda environments with conda env create, this time being confident that whatever dependency problem I see is limited to the scope of a particular environment. This led me to isolate the pillow dependency problem quite rapidly. For some reason, the combination that finally worked was the below: all packages installed from conda-forge, except pillow installed by pip:

nest- asyncio 
6 12 
notebook- shim 
opens sl 
pa ndocfilters 
conda - forge 
conda - forge 
pi c kleshare 
c Onda - forge 
pi I I cw 
c Onda - forge 
conda - forge 
prorpt - tool ki t 
conda - forge 
conda - forge 
conda - forge 
conda - forge 
conda - forge 
conda-fo rge 
pyhd8edI ab_ø 
- forge 
- forge 
- forge 
- forge 
- Forge 
- Forge 

Lessons learned: use virtual environments and don’t install python dependencies outside them. I am now a happy user of Miniconda. I use commandline conda install instead of the Anaconda package manager GUI. More control, less risk, better sleep.

pushing data to AWS. SageMaker sucks. So does Anaconda
Tagged on:                             

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.