Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.
I am using pyspark 2.0 to create a DataFrame object by reading a csv using: data = spark.read.csv('data.csv', …
apache-spark spark-dataframe apache-spark-2.0I am trying to create a DataFrame using RDD. First I am creating a RDD using below code - val …
scala apache-spark spark-dataframe apache-spark-datasetI'm working on a spark mllib algorithm. The dataset I have is in this form Company":"XXXX","CurrentTitle":"XYZ","Edu_…
apache-spark apache-spark-sql spark-dataframe apache-spark-mllibHow do you replace single quotes with double quotes in Scala? I have a data file that has some records …
scala dataframe spark-dataframe double-quotes single-quotesGiven the following DataSet values as inputData: column0 column1 column2 column3 A 88 text 99 Z 12 test 200 T 120 foo 12 In Spark, what …
scala apache-spark spark-dataframe apache-spark-datasetCommmunity! Please help me understand how to get better compression ratio with Spark? Let me describe case: I have dataset, …
apache-spark apache-spark-sql spark-dataframe parquet snappyI am saving my spark data frame output as csv file in scala with partitions. This is how i do …
scala apache-spark amazon-s3 spark-dataframe multipleoutputsI have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "…
pyspark spark-dataframe sparkrI thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.…
apache-spark apache-spark-sql spark-dataframe apache-spark-2.0 off-heapI created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as …
datetime apache-spark pyspark spark-dataframe python-datetime