Connect to Spark¶
This guide will help you set up FeatureByte with a Spark cluster.
Before You Begin¶
Gather the credentials for connecting to your Spark cluster, specifically:
- Name of the spark thrift server you're connecting to
- Credentials for the storage service that is used to stage files for the spark cluster
Thrift Server
Featurebyte only supports connecting to Spark clusters that have a thrift server running. Refer to Spark Thrift Server for more details.
Ensure that the thrift server has delta lake support available. Refer to Set up Apache Spark with Delta Lake for more details.
The following configuration needs to be set for the thrift server to work with FeatureByte:
Setup Guide¶
FeatureByte Installation
Make sure that you have FeatureByte installed. See installation for more details.
Step 1: Test that your connection works¶
We can now try to see if your connection works by trying to create a new feature store. We can do so by running the following commands (either in a notebook, or a python interactive shell).
- If you know that a feature store exists already, we can try to list the existing feature stores on the local spark environment.
-
Alternatively, try to create a feature store.
feature_store = fb.FeatureStore.create( # Name of the feature store that we want to create/connect to name="<feature_store_name", source_type=fb.SourceType.SPARK, details=fb.SparkDetails( host="<host>", port=10000, featurebyte_catalog="spark_catalog", featurebyte_schema="<schema_name>", storage_type=fb.StorageType.S3, storage_url="https://storage.googleapis.com/{<bucket_name>}/{<schema_name>}", storage_spark_url="gs://{<bucket_name>}/{<schema_name>}" ), storage_credential=fb.S3StorageCredential( s3_access_key_id="<s3_access_key_id>", s3_secret_access_key="<s3_secret_access_key>", ) )
feature_store = fb.FeatureStore.create( # Name of the feature store that we want to create/connect to name="<feature_store_name", source_type=fb.SourceType.SPARK, details=fb.SparkDetails( host="<host>", port=10000, featurebyte_catalog="spark_catalog", featurebyte_schema="<schema_name>", storage_type=fb.StorageType.GCS, storage_url="gs://<bucket_name>/<schema_name>", storage_spark_url="gs://<bucket_name>/<schema_name>" ), storage_credential=fb.GCSStorageCredential( service_account_info="<service_account_info>", ) )
feature_store = fb.FeatureStore.create( # Name of the feature store that we want to create/connect to name="<feature_store_name", source_type=fb.SourceType.SPARK, details=fb.SparkDetails( host="<host>", port=10000, featurebyte_catalog="spark_catalog", featurebyte_schema="<schema_name>", storage_type=fb.StorageType.AZURE, storage_url="azure://<container_name>/<schema_name>", storage_spark_url="abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<schema_name>" ), storage_credential=fb.AzureBlobStorageCredential( account_name="<account_name>", account_key="<account_key>", ) )
feature_store = fb.FeatureStore.create( # Name of the feature store that we want to create/connect to name="<feature_store_name", source_type=fb.SourceType.SPARK, details=fb.SparkDetails( host="<host>", port=10000, featurebyte_catalog="spark_catalog", featurebyte_schema="<schema_name>", storage_type=fb.StorageType.WEBHDFS, storage_url="https://<webhdfs_host>:9871/user/hive/staging/<schema_name>", storage_spark_url="hdfs://user/hive/staging/<schema_name>", ), database_credential=fb.KerberosKeytabCredential.from_file( keytab_filepath="/etc/security/keytabs/hive.service.keytab", principal="hive/<host>@<realm>", ), )
Step 2: Connect to your Spark feature store¶
Congratulations! You have successfully connected to your Spark server if you are able to run these commands without any errors!
Next Steps¶
Now that you've connected to your data, feel free to try out some tutorials!