Issue
I am using below code to get the data from BigQuery.
public class BQTEST {
public static void main(String... args) throws Exception {
String datasetName = "mydataset";
String tableName = "mytable";
String projectId = "gcs";
String query =
"SELECT id, " +
"qtr, " +
"sales, " +
"year " +
"FROM `gcs.mydataset.mytable` " +
;
BigQuery bigquery = BigQueryOptions.newBuilder().setProjectId(projectId)
.setCredentials(
ServiceAccountCredentials.fromStream(new
FileInputStream("20a3c78f8388.json"))
)
.build().getService();
TableId tableId = TableId.of(projectId, datasetName, tableName);
QueryJobConfiguration queryConfig = QueryJobConfiguration
.newBuilder(query)
.build();
try {
bigquery.query(queryConfig);
} catch (InterruptedException e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
TableResult results = bigquery.listTableData(
tableId,
BigQuery.TableDataListOption.pageSize(1),
BigQuery.TableDataListOption.startIndex(5)
);
for (FieldValueList row : results.iterateAll()) {
System.out.printf(
"ID: %s qtr: %s sales: %s year: %s\n", row.get(0).getValue(), row.get(1).getValue(), row.get(2).getValue(), row.get(3).getValue());
}
}
}
BigQuery.TableDataListOption.startIndex(5)
-- This will help to read data from 5th index, index starts from 0.
But I want to read data in chunks like the first 10 records in one process, then the next 20 records in another process and etc.
Trying to read data from a very big table in a parallel process. It's a Truncate/Load table. So no partitions and date range to read data.
I am not able to find a way to give LAST_INDEX
value in any of the BigQuery Methods.
Could someone help with this?
Solution
There is no LAST_INDEX
method you can you for the pagination, as you can check in the documentation.
About your request:
I want to read data in chunks like the first 10 records in one process, then the next 20 records in another process and etc.,
Using python you can use some parameters as max_results
and start_index
to perform it, but in java the only way will be paginating on your query, and change it for each process. So for each process in parallel you will have a different query.
So, each process will have to:
- Order by some field (or by all the fields) to guarantee every query will return the data in the same order
- Paginate using
limit
andoffset
:
i.e:
String query = "SELECT id, qtr, sales, year FROM `gcs.mydataset.mytable` " +
"ORDER BY id, qtr, sales, year " +
"LIMIT 10 " +
"OFFSET " + String.valueOf(process_number * 10)
;
Process 0 will have lines 0-9 (limit 10 offset 0);
Process 1 will have lines 10-19 (limit 10 offset 10)
Answered By - ewertonvsilva