Skip to content

Databricks

The Databricks data provider allows ChartFactor to interact with Databricks.

Setup

To use the Databricks data provider you need include its library:

1
<script src="./cft-databricks-provider.min.js"></script>

The Provider JSON object requires a name, the provider type (e.g. 'databricks'), host, warehouse id, and token parameters. Example:

1
2
3
4
5
6
7
8
9
// define providers
var providers = [{
    name:'Databricks',
    provider:'databricks',
    host: 'DATABRICKS_HOST',
    warehouseId: 'DATABRICKS_WAREHOUSE_ID',
    token: 'DATABRICKS_TOKEN',
    metadata: {}
}]
Optionally, the configuration for Databricks allows to specify headers in case header elements need to be provided (e.g. User-Agent).

1
2
3
4
var providers = [{
    ...
    headers: { "User-Agent": "myapp/1.0.0"},
}]

Supported provider properties:

  • host: The Server Hostname value
  • warehouseId: The Databricks warehouse id value
  • token: The Databricks personal access token value
  • headers: The HTTP headers to be sent to the Databricks server, if the Authorization header is provided it will be overwritten by the token value
  • metadata: Please refer to the Custom Metadata section for a description on how to configure custom metadata

Finally, use the setProviders() method of ChartFactor to set your data provider definitions. Example:

1
cf.setProviders(providers);

Partitioning and Clustering Support

Databricks tables can be partitioned and clustered (Liquid Clustering) to improve query performance and efficiency. To take advantage of these optimizations, queries must include the partitioned or clustered column in their WHERE clause.

In some cases, the value of this column is derived from a hash or computed using one or more functions. Within ChartFactor, this behavior can be configured using Custom Metadata by adding a routing property to the corresponding data source.

The routing Property

The routing property accepts an array of objects, where each object must define the following fields:

  • field: The name of the field used in the filter.
  • partitionField: The name of the column used for partitioning or clustering in Databricks.
  • Either function or query: Defines how to transform or resolve the filter value.

For query based routing, you should provide a SQL expression in the query property that uses the special variable {filter_value} to represent the value of the filter applied on the field property. An example is presented below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// define providers
var providers = [{
    name:'Databricks',
    provider:'databricks',
    ...
    metadata: {
      "catalog_name:schema_name.your_table": {
        routing: [
            {
                field: 'company',
                partitionField: 'company_hash',
                query: "MOD( ABS( hash({filter_value}) ), 10 )"
            }
        ]
      }
    }
}]

For function based routing, the function property must receive a function that takes as input the value of the filter applied on the field property and returns the computed value for the partitionField. An example is presented below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// define providers
var providers = [{
    name:'Databricks',
    provider:'databricks',
    ...
    metadata: {
      "catalog_name:schema_name.your_table": {
        routing: [
            {
                field: 'company',
                partitionField: 'company_hash',
                function: (value) => { // the value of the filter when a column filter is applied
                    // assuming that the method  to generate a hash from a string is the same when ingesting data
                    let hash = 0;
                    for (let i = 0; i < value.length; i++) {
                        const char = value.charCodeAt(i);
                        hash = ((hash << 5) - hash) + char;
                        hash = hash & hash; // Convert to 32bit integer
                    }
                    return hash;
                }
            }
        ]
      }
    }
}]

Note

If you provide both function and query, the query property will take precedence.

Supported Aggregations Out-Of-The-Box

SUM

1
    var metric = cf.Metric("amount","sum");

AVG

1
    var metric = cf.Metric("amount","avg");

MIN

1
    var metric = cf.Metric("amount","min");

MAX

1
    var metric = cf.Metric("amount","max");

PERCENTILES

1
    var metric = cf.Metric('commission', 'percentiles');

COUNT DISTINCT

1
    var metric = cf.Metric("my_attribute","unique");

APPROXIMATE COUNT DISTINCT

For large datasets, the approximate count distinct aggregation function unique_approx can be used to improve performance when exact counts are not required.

1
    var metric = cf.Metric("my_attribute","unique_approx");

Dependencies

  • None