Wednesday, May 18, 2022

[FIXED] Vertica data into pySpark throws "Failed to find data source"

May 18, 2022 apache-spark, maven, pyspark, python-3.x, vertica

Issue

I have spark 3.2, vertica 9.2.

spark = SparkSession.builder.appName("Ukraine").master("local[*]")\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-jdbc-9.2.1-0.jar')\
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-spark-3.2.1.jar')\
.getOrCreate()

table = "test"
db = "myDB"
user = "myUser"
password = "myPassword"
host = "myVerticaHost"
part = "12";

opt = {"host" : host, "table" : table, "db" : db, "numPartitions" : part, "user" : user, "password" : password}

df = spark.read.format("com.vertica.spark.datasource.DefaultSource").options().load()

gives

Py4JJavaError: An error occurred while calling o77.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at
http://spark.apache.org/third-party-projects.html

~/shivamenv/venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~/shivamenv/venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Before this step, i have wget 2 jars into spark jars folder (the ones in sparksession config)

which i obtained from

https://libraries.io/maven/com.vertica.spark:vertica-spark https://www.vertica.com/download/vertica/client-drivers/

not sure what i'm doing wrong here, is there an alternative to the spark jars option?

In the below link -

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SparkConnector/GettingTheSparkConnector.htm?tocpath=Integrating%20with%20Apache%20Spark%7C_____1

they mention

Both of these libraries are installed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:

The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib. The JDBC client library is /opt/vertica/java/vertica-jdbc.jar

Should one try to replace local folder jars with these?

Solution

There is no need to replace local folder jars. Once you copy them to spark cluster, you would run spark-shell command with the following options. Please find sample example below.Once a side note, vertica officially supports only spark 2.x with vertica 9.2 version. I hope this helps.

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm

spark-shell --jars vertica-spark2.1_scala2.11.jar,vertica-jdbc-9.2.1-11.jar

date 18:26:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://dholmes14:4040 Spark context available as 'sc' (master = local[*], app id = local-1597170403068). Spark session available as 'spark'. Welcome to

/ / ___ / / \ / _ / _ `/ __/ '/ // .__/_,// //_\ version 2.4.6 //

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) Type in expressions to have them evaluated. Type :help for more information.

scala> import org.apache.spark.sql.SparkSession

import org.apache.spark.storage._

val df1 = spark.read.format("com.vertica.spark.datasource.DefaultSource").option("host", "").option("port", 5433).option("db", "").option("user", "dbadmin").option("dbschema", "").option("table", "").option("numPartitions", 3).option("LogLevel", "DEBUG").load()

val df2 = df1.filter("column_name between 800055 and 8000126").groupBy("column1", "column2")

spark.time(df2.show())

Answered By - geekofgeeks
Answer Checked By - Senaida (JavaFixing Volunteer)

This Answer collected from stackoverflow and tested by JavaFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, May 18, 2022

[FIXED] Vertica data into pySpark throws "Failed to find data source"

Issue

Solution

Popular Posts

Labels