Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

How to query JSON data column using Spark DataFrames?

I have a Cassandra table that for simplicity looks something like: key: text jsonData: text blobData: blob I can create …

scala apache-spark dataframe apache-spark-sql spark-cassandra-connector
What are the various join types in Spark?

I looked at the docs and it says the following join types are supported: Type of join to perform. Default …

scala apache-spark apache-spark-sql spark-dataframe apache-spark-2.0
Flattening Rows in Spark

I am doing some testing for spark using scala. We usually read json files which needs to be manipulated like …

scala apache-spark apache-spark-sql distributed-computing
Where do you need to use lit() in Pyspark SQL?

I'm trying to make sense of where you need to use a lit value, which is defined as a literal …

python apache-spark pyspark apache-spark-sql
Upacking a list to select multiple columns from a spark data frame

I have a spark data frame df. Is there a way of sub selecting a few columns using a list …

apache-spark apache-spark-sql spark-dataframe
Parse CSV as DataFrame/DataSet with Apache Spark and Java

I am new to spark, and I want to use group-by & reduce to find the following from CSV (one …

java apache-spark hadoop apache-spark-sql hdfs
Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?

I am using Spark 1.5. I have two dataframes of the form: scala> libriFirstTable50Plus3DF res1: org.apache.spark.…

scala apache-spark join apache-spark-sql
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?

I'm using HiveContext with SparkSQL and I'm trying to connect to a remote Hive metastore, the only way to set …

apache-spark hive apache-spark-sql
"INSERT INTO ..." with SparkSQL HiveContext

I'm trying to run an insert statement with my HiveContext, like this: hiveContext.sql('insert into my_table (id, score) …

apache-spark apache-spark-sql pyspark apache-spark-1.5 hivecontext
Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend …

apache-spark pyspark apache-spark-sql