Title
A declarative approach to optimizing massively parallel data
processing (Research)
Abstract
Database research has witnessed a renewed interest for parallel data processing. While distributed
and parallel data management systems have been around for quite some time, it is the rise of
cloud computing and the advent of big data that present new challenges. Nowadays, parallelism is
not restricted to a handful of servers, but is massive ranging from hundreds to tens of thousands
of computing nodes. Queries are not limited to simple keyword search but involve complex join
queries over multiple database tables in support of large-scale data analytics. Furthermore,
performance is no longer dominated by the number of I/O requests to external memory as in
traditional systems but by the communication cost for reshuffling data over the network during
query execution. The latter calls for novel techniques for analyzing and optimizing complex queries
in the massively parallel setting. Unfortunately, the rise of many different systems each with their
own characteristics has led to a divergence of ad-hoc specialized techniques that are difficult to
transfer between different systems. In this work, I want to develop a uniform approach towards
optimization of queries in massively parallel systems. In particular, my research proposal has the
following objectives: (1) develop a declarative framework for massively parallel data processing;
(2) study decision problems in support of static analysis of queries in this framework; (3) develop
general techniques for multi query optimization.
Period of project
01 October 2017 - 30 September 2019