PySpark DataFrames - way to enumerate without converting to Pandas?

Question 1

PySpark DataFrames - way to enumerate without converting to Pandas?

python apache-spark bigdata pyspark rdd

Maria Koroliuk · Sep 24, 2015 · Viewed 31.8k times · Source

Answer

Answer

It doesn't work because:

the second argument for withColumn should be a Column not a collection. np.array won't work here
when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier

PySpark >= 1.4.0

~~You can add row numbers using respective window function and query using Column.isin method or properly formated query string:~~

from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window

w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))

# Using DSL
indexed.where(col("index").isin(set(indexes)))

# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))

Question 2

I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)

In pandas, I could make just

indexes=[2,3,6,7] 
df[indexes]

Here I want something similar, (and without converting dataframe to pandas)

The closest I can get to is:

Enumerating all the objects in the original dataframe by:
```
indexes=np.arange(df.count())
df_indexed=df.withColumn('index', indexes)
```
- Searching for values I need using where() function.

QUESTIONS:

Why it doesn't work and how to make it working? How to add a row to a dataframe?

Would it work later to make something like:

 indexes=[2,3,6,7] 
 df1.where("index in indexes").collect()

Any faster and simpler way to deal with it?

PySpark DataFrames - way to enumerate without converting to Pandas?

Answer

Related questions