Spark SQL¶

Setup¶

The Spark SQL data provider allows ChartFactor to interact with Spark SQL.

<script src="./CFT-sparksql-provider.min.js"></script>

The Provider JSON object requires the url parameter in addition to name and provider parameters. Example:

// define providers
var providers = [{
    name:'Spark SQL',
    provider:'sparksql',
    url:'https://34.233.78.155'
}]

Then, use the setProviders() method of ChartFactor to set your data provider definitions. Example:

cf.setProviders(providers);

Additionally, custom headers can be passed to the spark configuration. By default it uses the following:

{
  'Content-Type': 'application/json',
  'Accept': 'application/json'
}

So, if we need to add extra headers, we should also provide the above configuration. For example:

var providers = [{
    ...
    headers: {
      'Authorization': 'Bearer ' + authToken,
      'Content-Type': 'application/json',
      'Accept': 'application/json'
    }
}]

cf.setProviders(providers)

This data provider assumes your Spark SQL server is fronted with an HTTP REST server with the following operations:

GET /tables: Returns the list of tables
GET /tables/{id}: Returns all fields and their types for the table specified by the {id} parameter
POST /query: Executes a SQL query and returns the results in JSON format

The spark-sql-rest project is a reference implementation that provide a REST front to SparkSQL.

Partitioning and Clustering Support¶

SparkSQL tables can be partitioned and clustered (Liquid Clustering) to improve query performance and efficiency. To take advantage of these optimizations, queries must include the partitioned or clustered column in their WHERE clause.

In some cases, the value of this column is derived from a hash or computed using one or more functions. Within ChartFactor, this behavior can be configured using Custom Metadata by adding a routing property to the corresponding data source.

The `routing` Property¶

The routing property accepts an array of objects, where each object must define the following fields:

field: The name of the field used in the filter.
partitionField: The name of the column used for partitioning or clustering in SparkSQL.
Either function or query: Defines how to transform or resolve the filter value.

For query based routing, you should provide a SQL expression in the query property that uses the special variable {filter_value} to represent the value of the filter applied on the field property. An example is presented below:

// define providers
var providers = [{
    name:'SparkSQL',
    provider:'sparksql',
    ...
    metadata: {
      "your_table": {
        routing: [
            {
                field: 'company',
                partitionField: 'company_hash',
                query: "MOD( ABS( hash({filter_value}) ), 10 )"
            }
        ]
      }
    }
}]

For function based routing, the function property must receive a function that takes as input the value of the filter applied on the field property and returns the computed value for the partitionField. An example is presented below:

// define providers
var providers = [{
    name:'SparkSQL',
    provider:'sparksql',
    ...
    metadata: {
      "your_table": {
        routing: [
            {
                field: 'company',
                partitionField: 'company_hash',
                function: (value) => { // the value of the filter when a column filter is applied
                    // assuming that the method  to generate a hash from a string is the same when ingesting data
                    let hash = 0;
                    for (let i = 0; i < value.length; i++) {
                        const char = value.charCodeAt(i);
                        hash = ((hash << 5) - hash) + char;
                        hash = hash & hash; // Convert to 32bit integer
                    }
                    return hash;
                }
            }
        ]
      }
    }
}]

Note

If you provide both function and query, the query property will take precedence.

Supported Aggregations Out-Of-The-Box¶

SUM¶

    var metric = cf.Metric("amount","sum");

AVG¶

    var metric = cf.Metric("amount","avg");

MIN¶

    var metric = cf.Metric("amount","min");

MAX¶

    var metric = cf.Metric("amount","max");

PERCENTILES¶

    var metric = cf.Metric('commission', 'percentiles');

COUNT DISTINCT¶

    var metric = cf.Metric("my_attribute","unique");

APPROXIMATE COUNT DISTINCT¶

For large datasets, the approximate count distinct aggregation function unique_approx can be used to improve performance when exact counts are not required.

    var metric = cf.Metric("my_attribute","unique_approx");

Dependencies¶

None