Issue
I have spark 3.2, vertica 9.2.
spark = SparkSession.builder.appName("Ukraine").master("local[*]")\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-jdbc-9.2.1-0.jar')\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-spark-3.2.1.jar')\
.getOrCreate()
table = "test"
db = "myDB"
user = "myUser"
password = "myPassword"
host = "myVerticaHost"
part = "12";
opt = {"host" : host, "table" : table, "db" : db, "numPartitions" : part, "user" : user, "password" : password}
df = spark.read.format("com.vertica.spark.datasource.DefaultSource").options().load()
gives
Py4JJavaError: An error occurred while calling o77.load.
: java.lang.ClassNotFoundException:
Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at
http://spark.apache.org/third-party-projects.html
~/shivamenv/venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~/shivamenv/venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Before this step, i have wget 2 jars into spark jars folder (the ones in sparksession config)
which i obtained from
https://libraries.io/maven/com.vertica.spark:vertica-spark https://www.vertica.com/download/vertica/client-drivers/
not sure what i'm doing wrong here, is there an alternative to the spark jars option?
In the below link -
they mention
Both of these libraries are installed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:
The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib. The JDBC client library is /opt/vertica/java/vertica-jdbc.jar
Should one try to replace local folder jars with these?
Solution
There is no need to replace local folder jars. Once you copy them to spark cluster, you would run spark-shell command with the following options. Please find sample example below.Once a side note, vertica officially supports only spark 2.x with vertica 9.2 version. I hope this helps.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm
spark-shell --jars vertica-spark2.1_scala2.11.jar,vertica-jdbc-9.2.1-11.jar
date 18:26:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://dholmes14:4040 Spark context available as 'sc' (master = local[*], app id = local-1597170403068). Spark session available as 'spark'. Welcome to
/ / ___ / / \ / _ / _ `/ __/ '/ // .__/_,// //_\ version 2.4.6 //
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) Type in expressions to have them evaluated. Type :help for more information.
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.storage._
val df1 = spark.read.format("com.vertica.spark.datasource.DefaultSource").option("host", "").option("port", 5433).option("db", "").option("user", "dbadmin").option("dbschema", "").option("table", "").option("numPartitions", 3).option("LogLevel", "DEBUG").load()
val df2 = df1.filter("column_name between 800055 and 8000126").groupBy("column1", "column2")
spark.time(df2.show())
Answered By - geekofgeeks
Answer Checked By - Senaida (JavaFixing Volunteer)