Reading Note - Presto The Definitive Guide

2020-06-06

A book about Presto, "Presto: The Definitive Guide" has been published by Starburst. We can get PDF for free from here O'Reilly Presto Guide Download – Starburst Data

The book covers a broader themes of Presto; from basic concepts to production ready deployment. This post is a part of my reading note that I especially wanted to learn deeper.

Presto Architecture

This is described especially in Chapter 4.

Connector

Presto can submit SQL queries on variaous external data sources, such as RDBMS, KVS, and object storages(S3, etc.). To access data sources, it uses "connector" to manage data and shape of them. Presto configures connectors by configuration files with specifiying properties, for example, URL, username, password, JDBC driver, etc.

Quick look terminologies:

Connector
- An adapter to each data source
Catalog
- Definition of how to access to a data source via connector
- Associated with connector
- Has more than one schemas
Schema
- A set of tables
Table
- A set of rows
- A set of named and typed columns

To see such metadata of each catalog/schema/table, we can use SQL statements like:

SHOW CATALOGS;
SHOW SCHEMAS FROM 'catalog_name';
SHOW TABLES FROM 'catalog_name.schema_name';
DESCRIBE 'catalog_name.schema_name.table_name';

Since it brings a separation of storage and computation, Connector-based architecture is an important part of Presto. If there is a connector to a data source which you want to access, Presto can execute queries to it via the connector. Each connectors need to implement SPI(service provider interface) which standardizes operations to data sources. There are thre parts of connector implementation:

Operations to fetch metadata of table/view/schema
Operations to produce logical units of data partitioning
- Presto uses partitions to read and write in parallel
Operations to convert data into in-memory format, and vice versa

In the viewpoint of Presto's SPI, these are categorized as Metadata SPI, Data Statistics SPI, and Data Location SPI which are used by coordinator, and Data Stream SPI which is used by worker.

Internal Architecture

As it segragates storage and computations thanks to connector-based archiecture, Presto is a distributed query engine and forms MPP(Massively Parallel Processing) style so that it can add computation nodes as it needs power of processing. Presto cluster is composed of single coordinator and multiple worker nodes. Presto clients such as JDBC driver and CLI, connect to coordinator, and then coordinator will handle submitted queries by cooperating with workers. Roles of coordinator and worker are:

Coordinator
- Parse and analyze SQL
- Build query plan
- Scheduling to workers
- Collect data processed by workers
- Expose results to clients through output buffer
- Discovery service
  - Manage all nodes living in the cluster
  - Get rid of failure worker by failure detector triggered by heartbeat failure
  - Typically, deployed as a embedded service in the coordinator
Worker
- Execute tasks assigned from coordinator
- Process actual data
- Send back results to coordinator
- Join to cluster via discovery service

Every communications between these components are done by HTTP(s).

Query Execution Model

When a SQL text is submitted to coordinator, coordinator start query planning by parsing and analyzing at first. Each phase of query planning depends on SPI like:

Parser and Analyzer use Metadata SPI
- SQL validation
- type checking
- security checking
Planner and Optimizer use Data Statistics SPI
- Retrieve statistics such as number of records, table size, etc. to optimize query based on its cost
Scheduler uses Data Location SPI
- Make a distributed query plan
- Produces logical splits for data

Terminologies during Query execution

To describe the query execution, it needs several terminologies. The top level concept is query, of course.

query
- SQL that client submits to coordinator
- Coordinator creates query plan and split it into fragments through the parsing/analyzing phases
fragment
- A part of query plan
- Treated as a stage in the runtime
stage
- Runtime incarnation of a fragment
- Make a dependency tree if a query has multiple stages
  - As complexity of the query increase, the number of stages get increased
- Consist of more at least tasks
task
- A unit of work during query execution
- Scheduled by coordinator across the workers
- Source task
  - A task to fetch data using data source SPI through connector
  - Data fetched by source task is treated as pages in Presto
split
- A unit of data that a task processes
- A unif of parallelism and work assignment
- e.g. split in Hive connector is a form of a path to a file with offset and length
page
- Data that source stage produces
- A collection of rows in columnar format
- Transferred between stages by some operators
  - Read data from upstream stage's tasks
operator
- Process input data and produce pages for a downstream operator according to its semantics
- e.g. scan, joins, aggregations, filter, projection, produce derived columns
pipeline
- A sequence of operators within a task
- The last operator is usually a operator to put pages into output buffer of the task
  - Exchange operator in downstream tasks consume the pages from an upstream tasks' output buffer
- After creating a task, instantiate a driver for each split
driver
- A implementation of a pipeline process the data in the split
- A task uses at least one drivers
  - depending on configurations and environments
- When all drivers are finished and data is passed to the next split, drivers and the task are finished and destroyed

Flow of query processing

Coordinator cooperate with workers to process query. A simplified flow of processing a query is like following:

Coordinator creates a list of splits with the metadata from the connector
Coordinator schedules tasks on the workers to fetch data in the splits using the list of splits
Coordinator tracks all splits available for processing and locations where tasks are running on workers
When tasks finish processing and producing more splits for downstream, coordinator continues to schedule tasks
- Continue scheduling recursively until no splits remain for processing
Coordinator expose the result data to the client once all splits are processed and all result data is available

Cost Based Optimizer

Cost based optimizer(CBO) is introduced as a state-of-the-art of Presto in this book. Cost based optimizer optimized query plan based on cost as the name implies. On the other hand, we also have syntactic optimizer that optimizes query plan based on query syntax. This means that cost based optimizer considers not only query syntax, but also cost to execute the query. The cost of query can be calculated using statistics of data, and shape of query.

If query optimizer does not care of cost, a query which has several predicates in where clause can get significantly slower because of the order of predicates even if its results are same. A person who trained enough to understand how syntactic optimizer works as well as trends of data is possible to build well efficient query. However, other tools like BI tool don't know that, thus they produces easily unefficient queries.

For example, these following queries must produce the same results.

SELECT a.id, b.id, c.id FROM a, b, c WHERE a.foo_id = b.foo_id AND b.bar_id = c.bar_id
SELECT a.id, b.id, c.id FROM a, b, c WHERE b.bar_id = c.bar_id AND a.foo_id = b.foo_id In this case, the join orders are different so that if 'a' table is much smaller than 'c' table, the first query can be significantly faster than the second. This is the reason why the cost based optimizer matters. Even if results of queries are the same but shapes of queries are different, cost based optimizer tries to make the most effective query plan for all.

You can see official document here: https://prestosql.io/docs/current/optimizer/cost-based-optimizations.html

Cost of Join

Presto implements an extended hash join to join two tables. One of the joined tables is called as build side which is used to build a lookup hash table, and the another is probe side which is processed using the lookup hash table to find matching build side rows in constant time. The build side is a base of the join, and Presto store all of build side in memory so that it's required to be able to fit into available memory capacity. To decide which is going to be build and probe side, Presto uses several statistics provided from connector.

Number of records
Number of unique values in each column
NULL ratio
Min and Max values in each column
Avarage data size of each column

Depending on connector implementation, some statistics can be missing, however, Presto try to use available statistics as much as possible.

Speaking of processing efficiency, Presto does 3-level hashing as default in order to parallelize as much as possible.

Records of join tables are distributed over workers based of hashed values of join condition columns.
- Records that have same values are assigned to the same nodes
- Node-level data assignment
Distribute records of build side to worker threads within a node based of hashed values
- Building a lookup hash table can be considered as CPU bound processing so that this distribution will improve throughput
Each worker thread creates a partition of a lookup hash table
- Two level lookup hash table prevent probe side records get scattered over multiple threads

Broadcast join vs Partitioned join

Presto offers 2 types of join, broadcast join and partitioned join which was formerly(or even now?) called as distributed join. Both strategies have trade-offs like:

Broadcast join
- All records of build side are broadcasted(copied) to all workers
- Must assign probe side records without duplication to avoid duplicated results
- This can save network traffic, but requires enough memory to hold all build side rows
Partitioned join
- Both of build side and probe side are redistributed across the cluster
- Each worker receives unique set of records over network
- Restribution uses a partition algorithm to send unique set of records properly
  - Records that have the same join condition key should be sent to the same worker
- This costs extra network traffic, but requires smaller memory than broadcast does

Summary

Presto is a fantastic software to let us run SQL on anything. So, it should get difficult to understand the whole of Presto. This book is really nice to grab overview of Presto's internal stuff, however, there are lots of details even this book does not cover. Therefore we might need to dig into Prestos' source codes and actually operate it to learn more. Noting that in the book, there are lots of topics this post does not cover, such as connector, deployment, query optimization rules, etc. If you are interested in, PDF is available in starburstdata.com for free as I mentioned.