Skip to content

[Feature Store] Feature group creation: provide a DataCatalogConfig while enabling glue table creation #2916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonvdk opened this issue Feb 8, 2022 · 7 comments
Assignees
Labels
component: feature store Relates to the SageMaker Feature Store Platform type: feature request

Comments

@simonvdk
Copy link

simonvdk commented Feb 8, 2022

Use case
Create a feature group with automatic glue table creation for the offline store metadata, while configuring the glue data catalog database and table names

Issue encountered
It seems that providing a DataCatalogConfig and setting disable_glue_table_creation to false are mutually exclusive:

  • I can either not configure the glue database and table names and enable the glue table creation, so that the glue table with the default name and database is created upon feature group creation
  • OR I can provide a DataCatalogConfig but then I have to disable the glue table creation, so that the requested glue table is not created upon feature group creation

But I cannot provide a DataCatalogConfig and enable the glue table creation. Error encountered:

An error occurred (ValidationException) when calling the CreateFeatureGroup operation: Validation Error: DataCatalogConfig is not permitted in the request unless AutoCreateGlueTable is turned off. Please either set AutoCreateGlueTable to false or remove DataCatalogConfig from the request.

Why this seems to be an issue:

  • this behaviour (mutually exclusive) is not mentioned in the documentation. Also, there is no further mention or example of how to configure the offline store data catalog in the documentation
  • given the current state of the documentation, a user may want to configure the name of the glue database and table where the offline store metadata will be stored, while benefiting from the glue table creation upon feature group creation (with all the configuration - schema, storage descriptor etc - coming from the feature group information)
  • this extract from the java SDK documentation seems to indicate that the DataCatalogConfig should not be mutually exclusive with the automatic table creation

Ways to reproduce issue
Reproduced with AWS SDK (2.50.0) and AWS CLI.
Providing an OfflineStoreConfig with both DisableGlueTableCreation=False and a DataCatalogConfig with configured glue database (already created) and a glue table (that does not yet exist) raises the above error. Providing the DataCatalogConfig with DisableGlueTableCreation=True does not raise, but the glue table is not created either.

Example with AWS CLI:

aws sagemaker create-feature-group --cli-input-json '{"EventTimeFeatureName": "timestamp", "Description": "", "RecordIdentifierFeatureName": "record_id", "FeatureDefinitions": [{"FeatureName": "record_id", "FeatureType": "Integral"}, {"FeatureName": "timestamp", "FeatureType": "String"}], "OfflineStoreConfig": {"S3StorageConfig": {"S3Uri": "s3://my_bucket/my_prefix", "KmsKeyId": "arn:aws:kms:region:account_id:key/key_id"}, "DataCatalogConfig": {"TableName": "my_table", "Catalog": "account_id", "Database": "my_db"}, "DisableGlueTableCreation": false}, "FeatureGroupName": "my-feature-group"}'

Expected output
A clearer documentation about how to configure the offline store data catalog (e.g. with an example in a notebook), and possibly the possibility to configure the data catalog while benefiting from the glue table creation

NB: A similar issue has been opened on the aws-cli repository

@clausagerskov
Copy link

clausagerskov commented Oct 26, 2022

so is this supported or not?

@psnilesh
Copy link
Contributor

@clausagerskov Apologies for the delay.

It seems that providing a DataCatalogConfig and setting disable_glue_table_creation to false are mutually exclusive:

Your conclusions are correct. Currently, we do not allow customers to provide DataCatalogConfig if they want Glue table to be auto-created. I'll discuss with the service team about the feasibility of supporting this.

In the mean time, we'll be updating both API documentation and notebook examples to make service expectations clear. Thank you for bringing this to our attention.

@brifordwylie
Copy link
Contributor

brifordwylie commented Mar 19, 2023

Thanks @simonvdk for the explanation of details. I'm running into the same problem. Basically I want to specify my database and table when the Feature Group is created AND I want the Glue Catalog table created for me.

So currently what's the workaround?

  • Create the feature group
  • Manually create the Glue Catalog Table

If I manually create the glue table.. I'm guessing I can use the Feature Definitions within the Feature Group to 'help me' create the types for the glue table.

Here's pseudo code of how I might approach this.. I'm happy for someone to suggest an easier/better alternative. :)

my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)
my_feature_group.load_feature_definitions(data_frame=self.input_df)
my_feature_group.create(
            s3_uri=s3_storage_path,
            record_identifier_name=self.id_column,
            event_time_feature_name=self.event_time_column,
            role_arn=self.sageworks_role_arn,
            enable_online_store=True
            data_catalog_config=my_config,
            disable_glue_table_creation=True
 )

<grab the feature definitions>
my_feature_group.feature_definitions
Out[4]: 
[FeatureDefinition(feature_name='id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='name', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='age', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='score', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='date', feature_type=<FeatureTypeEnum.STRING: 'String'>)]

<take the above output, extract names, types, and fill in 'StorageDescriptor':  'Columns'>
<manually create Glue catalog table with boto3>

boto3 create table docs

Yes? .. this seems like a lot of work just so that we can place Feature Groups where we want them...

@toderesa97
Copy link

We are facing something similar, but now by using Iceberg table types. If we set:

import pandas as pd
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.inputs import DataCatalogConfig
from sagemaker.feature_store.inputs import TableFormatEnum

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-feature-store'

customers_df = pd.read_csv('.././data/transformed/customers.csv')
customers_feature_group_name = f'customers-whatever'
customers_feature_group = FeatureGroup(name=customers_feature_group_name, sagemaker_session=sagemaker_session)
customers_feature_group.load_feature_definitions(data_frame=customers_df)

customers_feature_group.create(
    s3_uri=f's3://{default_bucket}/{prefix}', 
    record_identifier_name='customer_id', 
    event_time_feature_name='event_time', 
    role_arn=role, 
    enable_online_store=True,
    table_format=TableFormatEnum.ICEBERG,
    disable_glue_table_creation=True,
    data_catalog_config=DataCatalogConfig(
        catalog='AwsDataCatalog',
        database='dev_engineering_provisioned',
        table_name='customers_feature_group'
    )
)

we get:

An error occurred (ValidationException) when calling the CreateFeatureGroup operation: Validation Error: Iceberg table format is only supported when DisableGlueTableCreation is turned off. Please either set DisableGlueTableCreation to false or use Default table format.

And then, if we indeed turn if off (ie, changing disable_glue_table_creation=True to disable_glue_table_creation=False), we end up having the issue describe in this issue.

An error occurred (ValidationException) when calling the CreateFeatureGroup operation: Validation Error: DataCatalogConfig is not permitted in the request unless AutoCreateGlueTable is turned off. Please either set AutoCreateGlueTable to false or remove DataCatalogConfig from the request.

Which is surprising, because according to the doc, this parameter does not exist.

In our case, we don't want to create the table and database in defaults, but in specific ones.

@francescocamussoni
Copy link

Hello! I'm facing the same problem, is there a solution for this?

@yanivg10
Copy link

yanivg10 commented Aug 1, 2023

I encountered the same issue. Due to data governance protocols, I cannot place the features in the default database (sagemaker_featurestore), so I would like to select a different database. However, I encountered the same issues described by the users posted on this thread. A prompt solution would be greatly appreciated!

@martinRenou martinRenou added type: feature request component: feature store Relates to the SageMaker Feature Store Platform labels Sep 22, 2023
@mollyheamazon mollyheamazon self-assigned this Feb 4, 2025
@mollyheamazon
Copy link
Contributor

Partner team confirmed that they don't see a path to improving the behavior of documentation at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: feature store Relates to the SageMaker Feature Store Platform type: feature request
Projects
None yet
Development

No branches or pull requests

9 participants