Athena read_sql_query provides completely wrong results for qmark style parametrized queries with cache enabled #2956

scimas · 2024-09-09T20:39:28Z

Describe the bug

Exactly as stated in the title. If you're querying Athena using a qmark style query and use the query cache, and you change the actual parameter values between two query runs, aws-sdk-pandas/awswrangler retrieves the result of the previous query run with incorrect parameter values.

How to Reproduce

import awswrangler as wr

df = wr.athena.read_sql_query(
    sql="SELECT distinct id FROM my_database.my_table WHERE id in (?, ?)",
    database="my_database",
    ctas_approach=False,
    unload_approach=False,
    params=["id1", "id2"],
    paramstyle="qmark",
    workgroup="primary",
    athena_cache_settings={"max_cache_seconds": 300}
)
print(df)

df2 = wr.athena.read_sql_query(
    sql="SELECT distinct id FROM my_database.my_table WHERE id in (?, ?)",
    database="my_database",
    ctas_approach=False,
    unload_approach=False,
    params=["id3", "id4"],
    paramstyle="qmark",
    workgroup="primary",
    athena_cache_settings={"max_cache_seconds": 300}
)
print(df2)

this with a database and table should be enough to reproduce the issue.

Expected behavior

Since the parameters to the query have changed, a fresh query should be run.

Your project

No response

Screenshots

No response

OS

Amazon Linux 2023

Python version

3.12

AWS SDK for pandas version

layer:AWSSDKPandas-Python312-Arm64:12

Additional context

My guess is that this is happening due to aws-sdk-pandas using a hand-rolled cache implementation rather than just letting Athena handle the cache through the ResultReuseConfiguration of StartQueryExecution. As far as I know, Athena does not populate the query parameter values in the GetQueryExecution output. So you get the parametrized query without the parameter values and check the current query against it, which is incorrect.

The text was updated successfully, but these errors were encountered:

scimas added the bug Something isn't working label Sep 9, 2024

LeonLuttenberger linked a pull request Sep 9, 2024 that will close this issue

fix: Athena read_sql_query cache errors for qmark style parametrized queries #2957

Draft

LeonLuttenberger self-assigned this Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Athena read_sql_query provides completely wrong results for qmark style parametrized queries with cache enabled #2956

Athena read_sql_query provides completely wrong results for qmark style parametrized queries with cache enabled #2956

scimas commented Sep 9, 2024

Athena read_sql_query provides completely wrong results for qmark style parametrized queries with cache enabled #2956

Athena read_sql_query provides completely wrong results for qmark style parametrized queries with cache enabled #2956

Comments

scimas commented Sep 9, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context