In order to focus its research and identify where and how to invest, the Data Science Institute identified three areas of specific importance for the next few years: contextualisation, event data analytics and data integration.
The focus areas flow from the expertise that is currently present in the institute, the priorities of its researchers, how the data field is progressing, and the institute's view on data science as a whole.
The institute employs these components of the Data Science Cycle:
- understand user needs: As the first step in any data science project, it is paramount to thoroughly understand the needs and goals of the user, but also the context they are in.
- wrangle and manage data: Data wrangling refers to the preparation of data for analysis, that is, the process of gathering, extracting, cleaning, structuring, and enriching raw data into a desired format serving as input for further analysis.
- analyse domain knowledge: A lot of relevant knowledge is present in a more or less implicit form, such as tacit knowledge inside the head of experts, knowledge communicated by scientific papers but also knowledge represented by legislation or laws of physics.
- model: Data modelling concerns the machine learning and statistics phase in the data science loop and refers both to the construction of models from data and domain knowledge, as well as the construction and improvement of the algorithms needed to learn such models from the data.
- explore: In a very close link with modelling, data exploration concerns the human-in-the-loop investigation of data. In such exploratory process there is a short feedback loop between input of the analyst and output of an algorithm.
- evaluate: Throughout data modelling and exploration, the institute ensures the quality of the developed methods by extensive testing and evaluation.
- understand/disseminate/deploy: The ultimate aim of research is to gain insight and understanding of the world (e.g. in biological systems), be disseminated to colleagues and/or the wider public, or to be deployed into new tools and services, or implemented in guidelines and tutorials.
Numerous choices are made in any data analysis project: what and how much data is collected, what database schema is used to store the data, how the data is transformed, what cutoffs were used in the data cleaning process to remove low quality datapoints, what algorithms (with which parameters) were run on the data, etc. These choices can have an immense effect on the final results and conclusions, but - even when provenance is registered - are often not made explicit and thereby introduce (hidden) bias.
The DSI first research line focusses on exploring how data, analysis and results can be put into context so that such bias can be exposed.
Event data analytics
Event data belongs to the family of complex data, which is data that by nature doesn’t come in a nice rectangular format where each row represents an independent observation described by the same set of variables. Instead, this type of data typically contains observations which are correlated in some way, which can be described by a varying set of variables and which can have a partial order between the variables defined in temporal, spatial or other dimensions (or any combination thereof).
As event data pose various challenges from a data science perspective, the second research line within DSI will focus on techniques and models to analyse these.
Many researchers, industries and governmental organisations have more and more access to more and more data sources. Not only do they generate data themselves (e.g. using modern high-throughput technologies), large data repositories are also publicly available. One could argue that the availability of large data sources is sufficient to solve issues on sample size and significance in empirical research. However, without insight in the domain and without proper semantic integration, results obtained from such a brute-force approach often fail to meet expectations during the confirmation phase of analysis results. More data will lead to higher chances to discover small irrelevant effects to be highly significant that may deflect attention and resources from more pressing issues. When fundamental methodological principles are ignored or misused, we poison our knowledge by an unprecedented rate of false discoveries or needlessly slow down our acquisition of new knowledge.
Basically, we do not need just more data, but more types of data, which by definition raises the issue of data integration and data fusion - the third DSI research line.