Skip to content

BigQuery: max_results is ignored if bqstorage_client is used in to_dataframe or to_arrow #9174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tswast opened this issue Sep 4, 2019 · 3 comments · Fixed by #9178
Closed
Assignees
Labels
api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tswast
Copy link
Contributor

tswast commented Sep 4, 2019

Steps to reproduce

  1. Call list_rows with max_results set.
  2. Call to_dataframe or to_arrow.
  3. Observe that more rows were returned than were requested.

Code example

from google.cloud import bigquery
from google.cloud import bigquery_storage
bqclient = bigquery.Client()
bqstorage_client = bigquery_storage.BigQueryStorageClient()

df_tabledata_list = bqclient.list_rows(
    "bigquery-public-data.utility_us.country_code_iso",
    selected_fields=[bigquery.SchemaField("country_name", "STRING")],
    max_results=100,
).to_dataframe()
print("tabledata.list: {} rows".format(len(df_tabledata_list.index)))

df_bqstorage = bqclient.list_rows(
    "bigquery-public-data.utility_us.country_code_iso",
    selected_fields=[bigquery.SchemaField("country_name", "STRING")],
    max_results=100,
).to_dataframe(bqstorage_client=bqstorage_client)
print("bqstorage: {} rows".format(len(df_bqstorage.index)))

Output

tabledata.list: 100 rows
bqstorage: 278 rows

Possible fixes

  1. (Harder) Keep track of how many rows you've downloaded in a BQ Storage session so far. Once you've downloaded enough rows, close all streams (is this even possible?).
  2. (Easier, but acceptable) If max_results is set, always download data with tabledata.list.

I think we should implement the fix with (2) because if max_results is set, it's unlikely that we are downloading all that many rows where using the BQ Storage API makes sense.

@tswast tswast added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. api: bigquerystorage Issues related to the BigQuery Storage API. labels Sep 4, 2019
@plamut
Copy link
Contributor

plamut commented Sep 4, 2019

@tswast After implementing the second option, would it make sense to spend some time on researching the option 1? It might turn out that streams can be closed at the right time reliably.

@tswast
Copy link
Contributor Author

tswast commented Sep 4, 2019

After implementing the second option, would it make sense to spend some time on researching the option 1?

I worry that the additional logic of closing streams may be more complicated than it's worth, but it's worth investigating to see if it's possible.

@plamut
Copy link
Contributor

plamut commented Sep 8, 2019

There exists the FinalizeStream method, but is not a silver bullet:

  • Cancelling a stream does not delete it, but just causes that no additional data is assigned to the stream. As per docs, the client must still continue reading the stream to process any data that has already been allocated to the stream.
  • While we could read the additional data and discard it, users would then be charged for more rows / data that they requested (I presume?). Additionally, even a single response block might contain enough rows to exceed the max_results limit, again making the cost higher than expected.
  • Ignoring any remaining data on the stream (by stopping reading from it) could leak resources on the backend, especially if done excessively by the client. We probably don't want to take this route anyway.

TL; DR - by trying to mimic the max_results in the client, it seems that we could get in more trouble than it's worth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
2 participants