PySpark DataFrames - way to enumerate without converting to Pandas?

Maria Koroliuk picture Maria Koroliuk · Sep 24, 2015 · Viewed 31.8k times · Source

I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)

In pandas, I could make just

indexes=[2,3,6,7] 
df[indexes]

Here I want something similar, (and without converting dataframe to pandas)

The closest I can get to is:

  • Enumerating all the objects in the original dataframe by:

    indexes=np.arange(df.count())
    df_indexed=df.withColumn('index', indexes)
    
    • Searching for values I need using where() function.

QUESTIONS:

  1. Why it doesn't work and how to make it working? How to add a row to a dataframe?
  2. Would it work later to make something like:

     indexes=[2,3,6,7] 
     df1.where("index in indexes").collect()
    
  3. Any faster and simpler way to deal with it?

Answer

zero323 picture zero323 · Sep 24, 2015

It doesn't work because:

  1. the second argument for withColumn should be a Column not a collection. np.array won't work here
  2. when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier

PySpark >= 1.4.0

You can add row numbers using respective window function and query using Column.isin method or properly formated query string:

from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window

w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))

# Using DSL
indexed.where(col("index").isin(set(indexes)))

# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))