Title
Optimizing advanced analytic tasks over distributed data (Research)
Abstract
In the era of big data, companies and scientific institutions are facing data that comes in varieties
and volumes never encountered before. At the same time, new needs and expectations exist about
the insight and intelligence that can be derived from these datasets using predictive analytics via
statistical and machine-learning models and algorithms. While sampling has been a common used
technique to bridge the gap between large datasets and deep analytics via expert tools, today,
driven by cheap storage and processing capacity, a huge desire exists to use the entire dataset to
leverage value in the most refined and holistic way possible.
In this proposal, we focus on the support of advanced big data analytics by a new generation of
distributed query engines. Here the term big data analytics is used as an umbrella term for complex
tasks that combine traditional query operations, like table joins, and operations from linear algebra,
like matrix multiplication. In particular, we aim to support big data analytics from a database
perspective, where a distributed query engine provides a solid supporting environment for effective
computation and optimization of typical advanced analytic tasks.
The overall goal of this project is to contribute to a better fundamental understanding of how
complex data analytic workflows can be executed in a big data setting, where distribution and
parallelization are key.
Period of project
01 January 2019 - 31 December 2022