Top "Emr" questions

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this …

json hadoop hive amazon-emr emr
"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146…

apache-spark emr amazon-emr bigdata
Compress file on S3

I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't …

amazon-s3 compression hive file-transfer emr
How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR, and need to copy some files to all EMR instances. One way as I understand …

amazon-s3 hadoop hive hdfs emr
Pyspark --py-files doesn't work

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html spsark version 1.1.0 ./spark/bin/spark-submit --py-files /home/…

python hadoop apache-spark emr
Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file …

apache-spark pyspark emr amazon-emr pyspark-sql
Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: …

amazon-s3 hive elastic-map-reduce emr
How to restart yarn on AWS EMR

I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to …

hadoop yarn emr
How to bootstrap installation of Python modules on Amazon EMR?

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a …

python amazon-web-services apache-spark emr
SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

I am trying to create a simple sql query on S3 events using Spark. I am loading ~30GB of JSON …

sql apache-spark amazon-ec2 emr