In order to focus its research and identify where and how to invest, the Data Science Institute identified three areas of specific importance for the next few years: contextualisation, event data analytics and data integration.
In order to focus its research and identify where and how to invest, the Data Science Institute identified three areas of specific importance for the next few years: contextualisation, event data analytics and data integration.
Numerous choices are made in any data analysis project: what and how much data is collected, what database schema is used to store the data, how the data is transformed, what cutoffs were used in the data cleaning process to remove low quality datapoints, what algorithms (with which parameters) were run on the data, etc. These choices can have an immense effect on the final results and conclusions, but - even when provenance is registered - are often not made explicit and thereby introduce (hidden) bias.
The DSI first research line focusses on exploring how data, analysis and results can be put into context so that such bias can be exposed.
Event data belongs to the family of complex data, which is data that by nature doesn’t come in a nice rectangular format where each row represents an independent observation described by the same set of variables. Instead, this type of data typically contains observations which are correlated in some way, which can be described by a varying set of variables and which can have a partial order between the variables defined in temporal, spatial or other dimensions (or any combination thereof).
As event data pose various challenges from a data science perspective, the second research line within DSI will focus on techniques and models to analyze these.
Many researchers, industries and governmental organisations have more and more access to more and more data sources. Not only do they generate data themselves (e.g. using modern high-throughput technologies), large data repositories are also publicly available. One could argue that the availability of large data sources is sufficient to solve issues on sample size and significance in empirical research. However, without insight in the domain and without proper semantic integration, results obtained from such a brute-force approach often fail to meet expectations during the confirmation phase of analysis results. More data will lead to higher chances to discover small irrelevant effects to be highly significant that may deflect attention and resources from more pressing issues. When fundamental methodological principles are ignored or misused, we poison our knowledge by an unprecedented rate of false discoveries or needlessly slow down our acquisition of new knowledge.
Basically, we do not need just more data, but more types of data, which by definition raises the issue of data integration and data fusion - the third DSI research line.