AnalysisException: u"cannot resolve 'name' given input columns: [ list] in sqlContext in spark

Elm662 picture Elm662 · Aug 18, 2016 · Viewed 42.6k times · Source

I tried a simple example like:

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

It works well, But when i try something very similar like:

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s/OnlineNewsTrainingAndValidation.csv' % MOUNT_NAME)

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("timedelta", "shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
display(data)

It raise error: AnalysisException: u"cannot resolve 'timedelta' given input columns: [ data_channel_is_tech,...

off-course I imported LabeledPoint and LinearRegression

What could be wrong?

Even the simpler case

df_cleaned = df_cleaned.select("shares")

raises same AnalysisException (error).

*please note: df_cleaned.printSchema() works well.

Answer

Elm662 picture Elm662 · Aug 18, 2016

I found the issue: some of the column names contain white spaces before the name itself. So

data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

worked. I could catch the white spaces using

assert " " not in ''.join(df.columns)  

Now I am thinking of a way to remove the white spaces. Any idea is much appreciated!