Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg commit errors thrown when using overwrite partition somehow fails to clean up temp table #2925

Open
snakingfire opened this issue Aug 7, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@snakingfire
Copy link

Describe the bug

When concurrent processes attempt to update the same iceberg table and one hits an ICEBERG_COMMIT_ERROR, sometimes the temp_table created by delete_from_iceberg_table fails to get cleaned up. Reading through the source code, I don't see a clear path for how this would be possible, since the code seems to correctly catch the exception and clean up in finally, but nonetheless the temp tables are getting left behind somehow.

I have not been able to create a reliable minimum code sample to replicate this behavior consistently, but in production we occasionally are hitting commit errors and accumulating temp tables in the glue catalog as a result:

awswrangler.exceptions.QueryFailed: ICEBERG_COMMIT_ERROR: Failed to commit Iceberg update to table
--
-- | -- wr.athena.to_iceberg(**write_args) |  
-- | --
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/_config.py", line 715, in wrapper |  
  |   | 2024-08-07 08:06:24 | return function(**args) |  
  |   | 2024-08-07 08:06:24 | ^^^^^^^^^^^^^^^^ |  
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/_utils.py", line 178, in inner |  
  |   | 2024-08-07 08:06:24 | return func(*args, **kwargs) |  
  |   | 2024-08-07 08:06:24 | ^^^^^^^^^^^^^^^^^^^^^ |  
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/athena/_write_iceberg.py", line 452, in to_iceberg |  
  |   | 2024-08-07 08:06:24 | delete_from_iceberg_table( |  
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/_config.py", line 715, in wrapper |  
  |   | 2024-08-07 08:06:24 | return function(**args) |  
  |   | 2024-08-07 08:06:24 | ^^^^^^^^^^^^^^^^ |  
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/_utils.py", line 178, in inner |  
  |   | 2024-08-07 08:06:24 | return func(*args, **kwargs) |  
  |   | 2024-08-07 08:06:24 | ^^^^^^^^^^^^^^^^^^^^^ |  
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/athena/_write_iceberg.py", line 680, in delete_from_iceberg_table |  
  |   | 2024-08-07 08:06:24 | wait_query(query_execution_id=query_execution_id, boto3_session=boto3_session) |  
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/_config.py", line 715, in wrapper |  
  |   | 2024-08-07 08:06:24 | return function(**args) |  
  |   | 2024-08-07 08:06:24 | ^^^^^^^^^^^^^^^^ |  
  |   | 2024-08-07 08:06:24 | File "/usr/local/lib/python3.11/site-packages/awswrangler/athena/_executions.py", line 237, in wait_query |  
  |   | 2024-08-07 08:06:24 | raise exceptions.QueryFailed(response["Status"].get("StateChangeReason"))

How to Reproduce

Broadly:
Have two processes call to_iceberg on the same catalog table simultaneously, with write params similar to:

{
    "df": to_write.copy(),
    "mode": "overwrite_partitions",
    "partition_cols": self.partition_cols,
    "database": self.glue_database_name,
    "table": self.table_name,
    "table_location": self.table_s3_path(),
    "temp_path": tmp_table_path,
    "boto3_session": self.boto3_client_wrapper.get_session(),
    "schema_evolution": True,
    "fill_missing_columns_in_df": True,
    "keep_files": False, 
}

If a commit error occurs, sometimes a temp table is left behind.

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.11

AWS SDK for pandas version

3.7.3

Additional context

No response

@snakingfire snakingfire added the bug Something isn't working label Aug 7, 2024
@jaidisido
Copy link
Contributor

This seems similar to #2826, looks like it was hard to replicate back then too

@snakingfire
Copy link
Author

Interesting, yes, seems very similar. Makes me wonder if there is a consistency issue with the Glue API that means a delete on an existent table can fail as table not found if it happens too soon after table creation. I can't seem to find anything in the docs specific to that though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants