How to Hit AWS Step Functions Limitations…

…and how to overcome them.

Raphael Bottino
Better Programming

--

This is part two of a two-part series of my learnings as a first-time user of AWS Step Functions. You can find the part one here.

TL;DR

I was able to implement a better architecture for my application and make it work. Until it didn't with buckets that had enough objects because I wrote a bad recursion. Fixing that, I realized AWS Step Functions and Lambda have yet another limitation that I wasn't aware of. But I got it fixed.

Curious? Keep reading.

Introduction

After hearing the feedback of a few readers from the last article, and finally having some spare time on hand, I finally set myself to try and implement a proper solution for my challenge. As a refresher, I had a lambda that would run for too long and, as a consequence, would time out fairly often depending on the input. It was a prime candidate for me to implement the same logic using AWS Step Functions — and a great excuse to finally use the service.

Revisiting the Proposed Architecture

What I had in mind by the end of the last article was to create an architecture similar to the one below, where I have 2 different workflows in AWS Step Functions:

The architecture proposed in the previous article.

The first one, list all objects inside a bucket. Then, for every hundred objects, it would invoke the second workflow. Then, the second workflow would generate a pre-signed URL for each of the objects in the input array, and push it to a queue.

However, when I started to implement it, I decided to go with a different approach. There would still be two Workflows, but they would work slightly differently than originally proposed.

First Workflow: Starter

Above you can see the first workflow. It is quite simple, actually. I call it the Starter workflow, since it’s the first one to run, and all it does is list all keys in a bucket. Then it starts the second workflow using this array with all keys as input.

Second Workflow: The loop.

This is where things get interesting. To avoid running into the previous problem of reaching the maximum number of historical events (see the first article). The first step of this Workflow is to select the (up to) first 500 keys in the original array because I know I won’t run into this problem with that many keys from previous tests.

Then, in parallel, two distinct logics are executed. On the left side of the diagram, for each of the up to 500 keys, we have exactly what we had before: a lambda that takes the key generates a pre-signed URL, and pushes it to a queue. On the right side, there is a Choice State that checks if the original array, minus the 500 keys that are being processed, still has any keys left. If there isn’t, that’s pretty much it. However, if there are, it will execute this second workflow all over again for the remaining keys. All that means this won’t ever hit the historical events limit and, as a bonus, there is a lot of concurrency going on, speeding up the process.

After some time trying to implement the two new workflows, I went over some challenges, such as getting the Choice State wrong and getting the workflow to always call itself again, getting the recursion of the State Machine into an infinite loop. But, after some coding, and a few more mistakes, I got it done.

I did it! Or did I?

That was it. I did it. I was excited that I finally did it. I right away sent my code to Felipe, a friend that has a big interest in this, to have him test it in his account, so I can be sure I won't run into a new take on the old classic: "But it runs on my AWS account!"

But I did. It didn't work in his account.

I knew there were more objects in the bucket he used for testing the code than I had on mine, but I couldn’t understand why it would not work. After all, my new Step Functions filters just the first 500 keys to work with and that was the only issue I found beforehand. When troubleshooting the execution, I realized the second workflow was never triggered, so the problem was in my simple lambda to list all the keys.

Original code to list all keys

The code is fairly straightforward, but I did something wrong here. I used recursion poorly. As I mentioned in the previous article, each API call returns a page with a thousand objects. If I need more, I need to make the same call, passing the previous call’s NextContinuationToken as the ContinuationToken parameter. So I was calling the function over and over again, stacking them on each other and… for a bucket with enough objects, using all the memory allocated to my Lambda function, which was blocking it from moving along.

Part of the new code to list all keys

After changing the original code to the above, I removed the recursion and Felipe tested it again. And this time it didn't fail! At least not because the function was using all the memory available…

Yet another limitation

There I was finding yet another limitation. Both Step Functions and Lambda (for asynchronous requests) have a limit of 256Kb for their payload.

This payload is too big for this lambda function

The array of keys generated by the Function had so many keys that it was bigger than 256KB, breaking the continuity of the workflow. Again, just like all of my challenges so far, that’s on me. RTFM.

AWS has a recommendation for Step Functions that require passing large payloads around. Just don’t. Instead, AWS recommends using S3 to save the payload and pass the object arn around for the next step in the workflow to read the payload from there. I quickly changed my code to use this approach and, finally, the code works as expected!

Conclusion

I learned so much getting this code to a state that I am comfortable sharing it with my peers, but I also wasted a good amount of time just because I decided to do it instead of first reading at least a bit of the service documentation.

I still highly recommend using AWS Step Functions IF you are comfortable with these limitations and the workarounds to make your code work and maintain it long-term. I also recommend reading its best practices before you attempt writing your first line of code. Also, as I was writing this article, AWS released a Step Functions workshop that looks really promising.

Are you feeling more comfortable with AWS Step Functions after this 2 part series? Are you ready to start using it or do you think you should just use Lambda? Let me know in the comments session!

--

--

Cook as a hobby, married, father of 2 dogs, and Product Manager for Developer Experience at Trend Micro. My opinions are my own.