Issue
I have PersonBean which has City, Bday and a MetadataJson member variables.
I want to write partitioned data by bday and city. Partitioning by City and bday can be toggle on/off.
All works good if I am partitioning by both bday and city together. I can write MetadataJson in a text format.
But in cases where lets say City is toggled OFF, City is blank in my PersonBean(as expected) so I get an error -
org.apache.spark.sql.AnalysisException: Text data source supports only a single column, and you have 2 columns.;
When I write as CSV format, the same dataset, writes a blank 2nd column. Is there a way to remove the column for the writing as "text" format?
I don't want to create 3 separate beans for all combinations of partitions in my expected format.
1Bean- bday and MetadataJson
2Bean- City and MetadataJson
3Bean- bday and City and MetadataJson
JavaRDD<PersonBean> rowsrdd = jsc.parallelize(dataList);
SparkSession spark = new SparkSession(
JavaSparkContext.toSparkContext(jsc));
Dataset<Row> beanDataset = spark.createDataset(data.rdd(), Encoders.bean(PersonBean.class));;
String[] partitionColumns = new String[]{"City"}
beanDataset.write()
.partitionBy(partitionColumns)
.mode(SaveMode.Append)
.option("escape", "")
.option("quote", "")
.format("text")
.save("outputpath");
Solution
I used a "beanDataset.select("bday","MetadataJson") call before writing the bean. This way I can use same bean for different combinations of partitioning columns.
Answered By - blue01
Answer Checked By - Cary Denson (JavaFixing Admin)