Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

athena.to_parquet fails when mode=overwrite_partitions and partition_cols contains something like hour(timestamp_col). #2845

Open
useredsa opened this issue Jun 4, 2024 · 3 comments
Labels
backlog bug Something isn't working

Comments

@useredsa
Copy link

useredsa commented Jun 4, 2024

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

@LeonLuttenberger
Copy link
Contributor

Hey,

Unfortunately, because this implementation of to_iceberg relies on a mesh of Pandas and Athena queries, we can't currently support this option of using a partition transform function with mode="overwrite_partititons". However, we are exploring other APIs for refactoring to_iceberg, such as PyIceberg or other AWS Glue APIs, which would allow us to support this in the future.

Copy link

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

@useredsa
Copy link
Author

Hi @LeonLuttenberger,

I understand that it would be difficult to solve the issue, but should it be closed due to inactivity while it is still unsolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants