"No Filesystem for Scheme: gs" when running spark job locally

Question 1

"No Filesystem for Scheme: gs" when running spark job locally

apache-spark hadoop google-cloud-storage google-cloud-dataproc google-hadoop

Yaniv Donenfeld · Jan 5, 2015 · Viewed 9.8k times · Source

Answer

Answer

In Scala, add the following config when setting your hadoopConfiguration:

val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

Question 2

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)

When running the job locally on my Mac machine, I am getting the following error:

5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs

I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:

<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
     The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
</property>

I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:

<dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.2.0</version>
    <exclusions>
        <exclusion>  <!-- declare the exclusion here -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>

, and Hadoop 1.2.1 as follows:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>1.2.1</version>
</dependency>

The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.

"No Filesystem for Scheme: gs" when running spark job locally

Answer

Related questions