Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to add metadata to parquet/orc schemas directly #2901

Open
walter9388 opened this issue Jul 18, 2024 · 1 comment
Open

Ability to add metadata to parquet/orc schemas directly #2901

walter9388 opened this issue Jul 18, 2024 · 1 comment
Labels

Comments

@walter9388
Copy link

Is your feature request related to a problem? Please describe.
There is a new requirement to have metadata directly in all our cloud data at my workplace (this is due to the need to move data between different hosting solutions). This means that our fallback data formats have become avro/parquet as you can attach metadata directly to the schemas. However, there is currently no direct way to do this using the s3.to_parquet function, so I wonder if it is possible to add this capability?

Just FYI, I think the s3.to_parquet functionality is brilliant and saves so much effort when making glue tables with partitions etc., so I would really like to be able to carry on using it in our workflows rather than write custom boto3/pyarrow logic.

Describe the solution you'd like
Extra metadata can be added into the parquet schema using the metadata parameter in pa.schema (https://arrow.apache.org/docs/python/generated/pyarrow.schema.html).

Currently, the pyarrow schema is created in the write method in _S3WriteStrategy via the _data_types.pyarrow_schema_from_pandas function, which is approximately:

def pyarrow_schema_from_pandas(
    df: pd.DataFrame,
    index: bool,
    ignore_cols: list[str] | None = None,
    dtype: dict[str, str] | None = None
) -> pa.Schema:
    ...
    return pa.schema(fields=columns_types)

What I propose is that we add a new metadata parameter in the existing pyarrow_additional_kwargs dictionary. This avoids any changes in the API so that only a minor version bump would be needed.
This would also add the capability to do it for ORC files via the same pyarrow_additional_kwargs argument in the s3.to_orc function.

From there the metadata can be extracted and validated in the _S3WriteStrategy class (or in the _S3ParquetWriteStrategy/_S3ORCWriteStrategy child classes separately if these formats have different metadata constraints, I haven't researched this part yet). We then can pass the metadata to an amended _data_types.pyarrow_schema_from_pandas function:

def pyarrow_schema_from_pandas(
    df: pd.DataFrame,
    index: bool,
+   metadata: dict[str, str],
    ignore_cols: list[str] | None = None,
    dtype: dict[str, str] | None = None
) -> pa.Schema:
    ...
-   return pa.schema(fields=columns_types)
+   return pa.schema(fields=columns_types, metadata=metadata)

Describe alternatives you've considered
After digging into the code a bit more, I can see that you can attach your own schema directly via pyarrow_additional_kwargs which then overwrites the schema made by awswrangler here.

However, I would still argue that there is a need for the feature described above as I want awswrangler to make the schema for me, and there should be a way to simply pass a dictionary of file metadata to the schema generator function.
Maybe pyarrow_additional_kwargs isn't the best place for it though as I can see it is expanded directly in pyarrow.parquet.ParquetWriter, so the metadata key would have to be popped out of the dictionary before this point.
Let me know your thoughts.

Additional considerations
I know that there are several other functions in this library for handling parquet/orc metadata (i.e. read_parquet_metadata, read_orc_metadata, store_parquet_metadata), so we would need to check that these work correctly. I would have thought it would be fine though as they are designed to work with the parquet/orc specifications.

I am willing to submit a PR for this feature if approved.

@jaidisido
Copy link
Contributor

@walter9388, contributions are always welcome. We can discuss if pyarrow_additional_kwargs is indeed the best input argument to hold metadata on your PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants