Connect to Spark¶

This guide will help you set up FeatureByte with a Spark cluster.

Before You Begin¶

Gather the credentials for connecting to your Spark cluster, specifically:

Name of the spark thrift server you're connecting to
Credentials for the storage service that is used to stage files for the spark cluster

Thrift Server

Featurebyte only supports connecting to Spark clusters that have a thrift server running. Refer to Spark Thrift Server for more details.

Ensure that the thrift server has delta lake support available. Refer to Set up Apache Spark with Delta Lake for more details.

The following configuration needs to be set for the thrift server to work with FeatureByte:

spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalogImplementation=hive
spark.hadoop.metastore.catalog.default=spark

Setup Guide¶

FeatureByte Installation

Make sure that you have FeatureByte installed. See installation for more details.

Step 1: Test that your connection works¶

We can now try to see if your connection works by trying to create a new feature store. We can do so by running the following commands (either in a notebook, or a python interactive shell).

If you know that a feature store exists already, we can try to list the existing feature stores on the local spark environment.

import featurebyte as fb

fb.FeatureStore.list()

# Get a feature store if it exists
feature_store = fb.FeatureStore.get(name="<feature_store_name>")

Alternatively, try to create a feature store.

feature_store = fb.FeatureStore.create(
    # Name of the feature store that we want to create/connect to
    name="<feature_store_name",
    source_type=fb.SourceType.SPARK,
    details=fb.SparkDetails(
        host="<host>",
        port=10000,
        featurebyte_catalog="spark_catalog",
        featurebyte_schema="<schema_name>",
        storage_type=fb.StorageType.S3,
        storage_url="<storage_url>",
        storage_spark_url="gs://dataproc-cluster-staging/{<schema_name>}"
    ),
    storage_credential=fb.S3StorageCredential(
        s3_access_key_id="<s3_access_key_id>",
        s3_secret_access_key="<s3_secret_access_key>",
    )
)

Step 2: Connect to your Spark feature store¶

# List all databases in the feature store
feature_store.get_data_source().list_databases()

Congratulations! You have successfully connected to your Spark server if you are able to run these commands without any errors!

Next Steps¶

Now that you've connected to your data, feel free to try out some tutorials!