Wednesday, January 12, 2022

[FIXED] BigQuery Select data based on Start Index and Last Index

January 12, 2022 google-bigquery, google-cloud-platform, java

Issue

I am using below code to get the data from BigQuery.

public class BQTEST {
    public static void main(String... args) throws Exception {
        String datasetName = "mydataset";
        String tableName = "mytable";
        String projectId = "gcs";
        String query =
                "SELECT id, " +
                        "qtr, " +
                        "sales, " +
                        "year " +
                        "FROM `gcs.mydataset.mytable` " +
                        ;
        BigQuery bigquery = BigQueryOptions.newBuilder().setProjectId(projectId)
                .setCredentials(
                        ServiceAccountCredentials.fromStream(new
                                FileInputStream("20a3c78f8388.json"))
                )
                .build().getService();
        TableId tableId = TableId.of(projectId, datasetName, tableName);
        QueryJobConfiguration queryConfig = QueryJobConfiguration
                .newBuilder(query)
                .build();
        try {
            bigquery.query(queryConfig);

        } catch (InterruptedException e) {
            e.printStackTrace();
            throw new RuntimeException(e.getMessage());
        }
        TableResult results = bigquery.listTableData(
                tableId,
                BigQuery.TableDataListOption.pageSize(1),
                BigQuery.TableDataListOption.startIndex(5)
        );

        for (FieldValueList row : results.iterateAll()) {
            System.out.printf(
                    "ID: %s qtr: %s sales: %s year: %s\n", row.get(0).getValue(), row.get(1).getValue(), row.get(2).getValue(), row.get(3).getValue());
        }


    }
}

BigQuery.TableDataListOption.startIndex(5) -- This will help to read data from 5th index, index starts from 0.

But I want to read data in chunks like the first 10 records in one process, then the next 20 records in another process and etc.

Trying to read data from a very big table in a parallel process. It's a Truncate/Load table. So no partitions and date range to read data.

I am not able to find a way to give LAST_INDEX value in any of the BigQuery Methods.

Could someone help with this?

Solution

There is no LAST_INDEX method you can you for the pagination, as you can check in the documentation.

About your request:

I want to read data in chunks like the first 10 records in one process, then the next 20 records in another process and etc.,

Using python you can use some parameters as max_results and start_index to perform it, but in java the only way will be paginating on your query, and change it for each process. So for each process in parallel you will have a different query.

So, each process will have to:

Order by some field (or by all the fields) to guarantee every query will return the data in the same order
Paginate using limit and offset:

i.e:

String query = "SELECT id, qtr, sales, year FROM `gcs.mydataset.mytable` " +
               "ORDER BY id, qtr, sales, year " +
               "LIMIT 10 " +
               "OFFSET " + String.valueOf(process_number * 10)
;

Process 0 will have lines 0-9 (limit 10 offset 0);
Process 1 will have lines 10-19 (limit 10 offset 10)

Answered By - ewertonvsilva

This Answer collected from stackoverflow and tested by JavaFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] BigQuery Select data based on Start Index and Last Index

Issue

Solution

Popular Posts

Labels