Issue
I have created a google dataflow streaming job to read from PubSub and insert into BigQuery. I am using STREAMING_INSERT api to insert JSON data to BiqQuery table. I am facing insertion issue stating request size is more than the permissible limit of 10 Mb. The dataflow error is as shown below. The size per record is 1-2Mb and based on my understanding dataflow jobs inserts streaming data as micro batch which is causing this error.
Could you please provide some resolution for this.
Error message from worker: java.lang.RuntimeException: We have observed a row that is 24625273 bytes in size. BigQuery supports request sizes up to 10MB, and this row is too large. You may change your retry strategy to unblock this pipeline, and the row will be output as a failed insert. org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:1088) org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:1242) org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.flushRows(BatchedStreamingWrite.java:403) org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.access$900(BatchedStreamingWrite.java:67)
Code snippet for BigQuery insert is as follows
.apply(
"WriteSuccessfulRecords",
BigQueryIO.writeTableRows().withAutoSharding()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to(options.getOutputTableSpec()));
Solution
Your BQ write operation is running into a known limitation of the BQ streaming inserts API. Basically, each batch of data written from Dataflow to BQ using streaming inserts has to be less than 10MB in size.
Dataflow tries to keep batches under this limit. But if a single row is larger than 10MB, then Dataflow cannot stay under this limit and hence may run into this issue. Can that be the case ?
Another option might be to use the BQ File Loads-based write method instead of streaming inserts.
Storage Write API based write mode unfortunately has the same limitation currently.
Answered By - chamikara
Answer Checked By - Gilberto Lyons (JavaFixing Admin)