The Data Science Institute performs research, supports education and provides consultancy across the data science cycle.
Figuur toevoegen Data Cycle
- understand user needs: As the first step in any data science project, it is paramount to thoroughly understand the needs and goals of the user, but also the context - is the user the real user, are there other stakeholders involved, what are the decisions that need to be made and supported, what is the process where these decisions take place - , constraints and degrees of freedom to act in order to satisfy the needs or meet the goals of the user. These can range from very specific to more general questions.
- wrangle & manage data: Data wrangling refers to the preparation of data for analysis, that is, the process of gathering, extracting, cleaning, structuring, and enriching raw data into a desired format serving as input for further analysis upstream in the data science life cycle. The knowledge acquired within the DSI is based on different types of data ranging from unstructured (e.g., text), over semi-structured (e.g., JSON) to structured data (e.g., relational) and includes event data, hierarchical data, spatio-temporal data, time-series data, etc. Good data management is paramount in support of data wrangling and concerns the planning, development, and administration of systems for storage, security, retrieval, dissemination, and archiving of data. The institute has a focus on scalable data storage (e.g. NoSQL solutions such as document-oriented databases, graph databases, key/value stores and RDF) and scalable data processing and querying (e.g. mapreduce/hadoop, spark, SPARQL). Methodologies that incorporate these include the lambda-architecture, distributed data management and computing, and cloud computing. Other important aspects include provenance, data cleaning, data integration, OLAP and OLTP and data architecture (logical & physical).
- analyse domain knowledge: Data is a very raw and imperfect container of knowledge. Recent developments in terms of computing power, storage and algorithms have allowed us to retrieve the knowledge hidden inside the data to tackle various problems. However, data is only part of the solution. A lot of relevant knowledge is also present in a more or less implicit form, such as tacit knowledge inside the head of experts, knowledge communicated by scientific papers but also knowledge represented by legislation or laws of physics. The true potential of data science lies in the combination of data and domain knowledge. Therefore, an important step in the data science cycle is the proper representation of domain knowledge at hand. For example, when analyzing data on new legislation in criminal law, one can try to learn a prediction model from the available data on how legislation will influence the crime numbers. However, despite good accuracy scores, such prediction model can contain counter-intuitive relationships. By modelling knowledge from domain experts on how crime numbers are impacted and using such models as part of the prediction model, such counter-intuitive relationships are often eliminated and prediction models become more explainable while maintaining predictive power. Many different formalisms exist to represent domain knowledge, each with their own strengths and weaknesses, ranging from different forms of logic, rulebases, fuzzy cognitive maps, Bayesian networks and the wide variety of statistical models.
- model: Data modelling concerns the machine learning and statistics phase in the data science loop and refers both to the construction of models from data and domain knowledge, as well as the construction and improvement of the algorithms needed to learn such models from the data. At the institute, research is performed using and developing different methodologies, including (but not limited to) clinical trial analysis, Long Short-Term Memory models, fuzzy cognitive maps and deep learning, time series prediction, decisions trees and spatial data analysis. In addition, advances are being made in dimensionality reduction and clustering (e.g. feature selection, distance metrics of complex data, fuzzy/collaborative clustering), as well as optimisation (including parameter tuning and meta-heuristics).
- explore: In a very close link with modelling, data exploration concerns the human-in-the-loop investigation of data. This involves, among others, visual analytics, topological data analysis and graph mining, uncertainty quantification and visualisation, rough sets, process conformance checking and exploratory event analytics. In such exploratory process there is a short feedback loop between input of the analyst and output of an algorithm, and in general follows an "overview first, zoom & filter, details on demand" paradigm: an overview of all data after transformation (typically dimensionality reduction or clustering) leads to the identification of points/regions-of-interest which can be investigated further. The ultimate aim of combining data exploration with data modelling is to augment the analyst in reaching data-driven conclusions.
- evaluate: Throughout data modelling and exploration, quality of the developed methods needs to be assured by extensive testing and evaluation. In most cases this involves quantitative measures well-described in literature concerning experiment design and selection of validation and test data. In some cases for data exploration this can involve qualitative evaluation (eg think-aloud protocol) when due to the nature of the data only a limited number of expert users are available or even exist.
- understand/disseminate/deploy: Data modelling and data exploration techniques that have been proven of high quality have limited value unless they are put to use. This can happen in different ways. First, the ultimate aim of much research is to gain insight and understanding of the world. This can range from better understanding of a biological system and the dynamics of epidemic spread, to deeper insight in the effect of a data-driven economy on business. Second, results of the data modelling/exploration phase are disseminated to colleagues and/or the wider public. Again this can be in different forms depending on the audience, ranging from scientific papers and conference presentations to articles in venues aimed at a wider public. Finally, the results can be deployed into new tools and services, or implemented in guidelines and tutorials, in order to support both academia and business in their research and decision-making processes.