I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message:
Reason: Container killed by YARN for exceeding memory limits.
5.5 GB of 5.5 GB …
I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?