Spark SQL¶
Setup¶
The Spark SQL data provider allows ChartFactor to interact with Spark SQL.
1 | |
The Provider JSON object requires the url parameter in addition to name and provider parameters. Example:
1 2 3 4 5 6 | |
Then, use the setProviders() method of ChartFactor to set your data provider definitions. Example:
1 | |
Additionally, custom headers can be passed to the spark configuration. By default it uses the following:
1 2 3 4 | |
So, if we need to add extra headers, we should also provide the above configuration. For example:
1 2 3 4 5 6 7 8 9 10 | |
This data provider assumes your Spark SQL server is fronted with an HTTP REST server with the following operations:
- GET /tables: Returns the list of tables
- GET /tables/{id}: Returns all fields and their types for the table specified by the {id} parameter
- POST /query: Executes a SQL query and returns the results in JSON format
The spark-sql-rest project is a reference implementation that provide a REST front to SparkSQL.
Partitioning and Clustering Support¶
SparkSQL tables can be partitioned and clustered (Liquid Clustering) to improve query performance and efficiency. To take advantage of these optimizations, queries must include the partitioned or clustered column in their WHERE clause.
In some cases, the value of this column is derived from a hash or computed using one or more functions. Within ChartFactor, this behavior can be configured using Custom Metadata by adding a routing property to the corresponding data source.
The routing Property¶
The routing property accepts an array of objects, where each object must define the following fields:
field: The name of the field used in the filter.partitionField: The name of the column used for partitioning or clustering in SparkSQL.- Either
functionorquery: Defines how to transform or resolve the filter value.
For query based routing, you should provide a SQL expression in the query property that uses the special variable {filter_value} to represent the value of the filter applied on the field property. An example is presented below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
For function based routing, the function property must receive a function that takes as input the value of the filter applied on the field property and returns the computed value for the partitionField. An example is presented below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Note
If you provide both function and query, the query property will take precedence.
Supported Aggregations Out-Of-The-Box¶
SUM¶
1 | |
AVG¶
1 | |
MIN¶
1 | |
MAX¶
1 | |
PERCENTILES¶
1 | |
COUNT DISTINCT¶
1 | |
APPROXIMATE COUNT DISTINCT¶
For large datasets, the approximate count distinct aggregation function unique_approx can be used to improve performance when exact counts are not required.
1 | |
Dependencies¶
- None