Skip to content

Connect to Spark

This guide will help you set up a FeatureByte feature store in a Spark cluster.

Before You Begin

Gather the credentials for connecting to your Spark cluster, specifically:

  • Name of the spark thrift server you're connecting to
  • Credentials for the storage service that is used to stage files for the spark cluster

Thrift Server

Featurebyte only supports connecting to Spark clusters that have a thrift server running. Refer to Spark Thrift Server for more details.

Ensure that the thrift server has delta lake support available. Refer to Set up Apache Spark with Delta Lake for more details.

The following configuration needs to be set for the thrift server to work with FeatureByte:

spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalogImplementation=hive
spark.hadoop.metastore.catalog.default=spark

Setup Guide

FeatureByte Installation

Make sure that you have FeatureByte installed. See installation for more details.

Step 1: Test that your connection works

We can now try to see if your connection works by trying to create a new feature store. We can do so by running the following commands (either in a notebook, or a python interactive shell).

  • If you know that a feature store exists already, we can try to list the existing feature stores on the local spark environment.
    import featurebyte as fb
    
    fb.FeatureStore.list()
    
    # Get a feature store if it exists
    feature_store = fb.FeatureStore.get(name="<feature_store_name>")
    
  • Alternatively, try to create a feature store.

    feature_store = fb.FeatureStore.create(
        # Name of the feature store that we want to create/connect to
        name="<feature_store_name",
        source_type=fb.SourceType.SPARK,
        details=fb.SparkDetails(
            host="<host>",
            port=10000,
            catalog_name="spark_catalog",
            schema_name="<schema_name>",
            storage_type=fb.StorageType.S3,
            storage_path="s3://{<bucket_name>}/{<schema_name>}"
            storage_url="s3://{<bucket_name>}/{<schema_name>}",
        ),
        storage_credential=fb.S3StorageCredential(
            s3_access_key_id="<s3_access_key_id>",
            s3_secret_access_key="<s3_secret_access_key>",
        )
    )
    
    feature_store = fb.FeatureStore.create(
        # Name of the feature store that we want to create/connect to
        name="<feature_store_name",
        source_type=fb.SourceType.SPARK,
        details=fb.SparkDetails(
            host="<host>",
            port=10000,
            catalog_name="spark_catalog",
            schema_name="<schema_name>",
            storage_type=fb.StorageType.GCS,
            storage_path="gs://<bucket_name>/<schema_name>"
            storage_url="gs://<bucket_name>/<schema_name>",
        ),
        storage_credential=fb.GCSStorageCredential(
            service_account_info="<service_account_info>",
        )
    )
    
    feature_store = fb.FeatureStore.create(
        # Name of the feature store that we want to create/connect to
        name="<feature_store_name",
        source_type=fb.SourceType.SPARK,
        details=fb.SparkDetails(
            host="<host>",
            port=10000,
            catalog_name="spark_catalog",
            schema_name="<schema_name>",
            storage_type=fb.StorageType.AZURE,
            storage_path="abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<schema_name>"
            storage_url="azure://<container_name>/<schema_name>",
        ),
        storage_credential=fb.AzureBlobStorageCredential(
            account_name="<account_name>",
            account_key="<account_key>",
        )
    )
    
    feature_store = fb.FeatureStore.create(
        # Name of the feature store that we want to create/connect to
        name="<feature_store_name",
        source_type=fb.SourceType.SPARK,
        details=fb.SparkDetails(
            host="<host>",
            port=10000,
            catalog_name="spark_catalog",
            schema_name="<schema_name>",
            storage_type=fb.StorageType.WEBHDFS,
            storage_path="hdfs://user/hive/staging/<schema_name>",
            storage_url="https://<webhdfs_host>:9871/user/hive/staging/<schema_name>",
        ),
        database_credential=fb.KerberosKeytabCredential.from_file(
            keytab_filepath="/etc/security/keytabs/hive.service.keytab",
            principal="hive/<host>@<realm>",
        ),
    )
    

Step 2: Connect to your Spark feature store

# List all databases in the feature store
feature_store.get_data_source().list_databases()

Congratulations! You have successfully connected to your Spark server if you are able to run these commands without any errors!

Next Steps

Now that you've connected to your data, feel free to try out some tutorials!