I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?
The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script.
Here's an example I'm using in production:
s3://mybucket/bootstrap/install_python_modules.sh
#!/bin/bash -xe
# Non-standard and non-Amazon Machine Image Python modules:
sudo pip install -U \
awscli \
boto \
ciso8601 \
ujson \
workalendar
sudo yum install -y python-psycopg2