Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

how to use Regexp_replace in spark

I am pretty new to spark and would like to perform an operation on a column of a dataframe so …

scala apache-spark apache-spark-sql regexp-replace
How to avoid duplicate columns after join?

I have two dataframes with the following columns: df1.columns // Array(ts, id, X1, X2) and df2.columns // Array(ts, …

scala apache-spark apache-spark-sql
SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table: my_data = sqlContext.read.parquet(…

scala apache-spark hive apache-spark-sql hdfs
Filtering DataFrame using the length of a column

I want to filter a DataFrame using a condition related to the length of a column, this question might be …

python apache-spark dataframe pyspark apache-spark-sql
Spark DataFrame: count distinct values of every column

The question is pretty much in the title: Is there an efficient way to count the distinct values in every …

apache-spark apache-spark-sql distinct-values
Join two ordinary RDDs with/without Spark SQL

I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join …

scala join apache-spark rdd apache-spark-sql
How to flatten a struct in a Spark dataframe?

I have a dataframe with the following structure: |-- data: struct (nullable = true) | |-- id: long (nullable = true) | |-- keyNote: …

java apache-spark pyspark apache-spark-sql
How do I detect if a Spark DataFrame has a column

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column …

scala apache-spark dataframe apache-spark-sql
Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. …

python apache-spark dataframe pyspark apache-spark-sql
spark dataframe drop duplicates and keep first

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes? …

dataframe apache-spark pyspark apache-spark-sql duplicates