Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.
I am pretty new to spark and would like to perform an operation on a column of a dataframe so …
scala apache-spark apache-spark-sql regexp-replaceI have two dataframes with the following columns: df1.columns // Array(ts, id, X1, X2) and df2.columns // Array(ts, …
scala apache-spark apache-spark-sqlI am migrating from Impala to SparkSQL, using the following code to read a table: my_data = sqlContext.read.parquet(…
scala apache-spark hive apache-spark-sql hdfsI want to filter a DataFrame using a condition related to the length of a column, this question might be …
python apache-spark dataframe pyspark apache-spark-sqlThe question is pretty much in the title: Is there an efficient way to count the distinct values in every …
apache-spark apache-spark-sql distinct-valuesI need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join …
scala join apache-spark rdd apache-spark-sqlI have a dataframe with the following structure: |-- data: struct (nullable = true) | |-- id: long (nullable = true) | |-- keyNote: …
java apache-spark pyspark apache-spark-sqlWhen I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column …
scala apache-spark dataframe apache-spark-sqlAs mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. …
python apache-spark dataframe pyspark apache-spark-sqlQuestion: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes? …
dataframe apache-spark pyspark apache-spark-sql duplicates