Trying to run a simple GraphFrame example using pyspark.
spark version : 2.0
graphframe version : 0.2.0
I am able to import graphframes in Jupyter:
from graphframes import GraphFrame
GraphFrame
graphframes.graphframe.GraphFrame
I get this error when I try and create a GraphFrame object:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-23-2bf19c66804d> in <module>()
----> 1 gr_links = GraphFrame(df_web_page, df_parent_child_link)
/Users/roopal/software/graphframes-release-0.2.0/python/graphframes/graphframe.pyc in __init__(self, v, e)
60 self._sc = self._sqlContext._sc
61 self._sc._jvm.org.apache.spark.ml.feature.Tokenizer()
---> 62 self._jvm_gf_api = _java_api(self._sc)
63 self._jvm_graph = self._jvm_gf_api.createGraph(v._jdf, e._jdf)
64
/Users/roopal/software/graphframes-release-0.2.0/python/graphframes/graphframe.pyc in _java_api(jsc)
32 def _java_api(jsc):
33 javaClassName = "org.graphframes.GraphFramePythonAPI"
---> 34 return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
35 .newInstance()
36
/Users/roopal/software/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/Users/roopal/software/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/Users/roopal/software/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling o138.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
The python code tries to read the java class (in the jar) I guess, but cant seem to find it. Any suggestions how to fix this?
Depending on your spark version, all you have to do is download the graphframe jar corresponding to your version of spark here https://spark-packages.org/package/graphframes/graphframes.
Then you'll have to copy the jar downloaded to your spark jar directory
root@93d8398b53f2:/usr/local/spark/jars# wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
There's the little tric right here, launch pyspark with arguments for the first time so that it downloads all the graphframe's jars dependencies:
root@93d8398b53f2:/usr/local/spark/bin# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
This should come up:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in central
downloading http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar ...
[SUCCESSFUL ] graphframes#graphframes;0.3.0-spark2.0-s_2.11!graphframes.jar (269ms)
downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar ...
[SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2!scala-logging-api_2.11.jar (53ms)
downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar ...
[SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2!scala-logging-slf4j_2.11.jar (66ms)
downloading https://repo1.maven.org/maven2/org/scala-lang/scala-reflect/2.11.0/scala-reflect-2.11.0.jar ...
[SUCCESSFUL ] org.scala-lang#scala-reflect;2.11.0!scala-reflect.jar (1409ms)
downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.7/slf4j-api-1.7.7.jar ...
[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.7!slf4j-api.jar (53ms)
:: resolution report :: resolve 6161ms :: artifacts dl 1877ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 5 | 5 | 0 || 5 | 5 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
5 artifacts copied, 0 already retrieved (4713kB/39ms)
Warning: Local jar /usr/local/spark-2.0.0-bin-hadoop2.7/bin/graphframes-0.3.0-spark2.0-s_2.11.jar does not exist, skipping.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/17 15:43:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/17 15:43:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>>
Meaning it has downloaded all the dependencies required. The important thing right here is Ivy Default Cache set to: /root/.ivy2/cache, precisely the jars stored in /root/.ivy2/jars
You can exit right after, if you insist in proceeding with the python code calling GraphFrame, it will call the error:
Py4JJavaError: An error occurred while calling o561.newInstance.
: java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.GraphFrame.
Let's see what's inside the directory /root/.ivy2/jars:
root@93d8398b53f2:/usr/local/spark/bin# ls /root/.ivy2/jars/
com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar org.scala-lang_scala-reflect-2.11.0.jar org.slf4j_slf4j-api-1.7.7.jar
Now you'll want to copy all the jars appearing in /root/.ivy2/jars to your spark's jars directory:
root@93d8398b53f2:/usr/local/spark/jars# cp /root/.ivy2/jars/* .
Launch pyspark for the second time:
root@93d8398b53f2:/usr/local/spark/jars# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
This should come up:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in central
:: resolution report :: resolve 748ms :: artifacts dl 27ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 0 | 0 | 0 || 5 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 5 already retrieved (0kB/24ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/17 15:53:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/17 15:53:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>>
You can now enjoy GraphFrame:
>>> # Create an Edge DataFrame with "src" and "dst" columns
... e = sqlContext.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> # Create a GraphFrame
... from graphframes import *
>>> g = GraphFrame(v, e)
>>>
>>> # Query: Get in-degree of each vertex.
... g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+
>>>
>>> # Query: Count the number of "follow" connections in the graph.
... g.edges.filter("relationship = 'follow'").count()
2
>>> results.vertices.select("id", "pagerank").show()
16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9059:
[rdd_337_0]
16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9060:
[rdd_337_1]
+---+-------------------+
| id| pagerank|
+---+-------------------+
| a| 0.01|
| b| 0.2808611427228327|
| c|0.27995525261339177|
+---+-------------------+