Research

Anomaly Discovery Paradigm

Staggering volumes of data sets collected by modern applications from financial transaction data to IoT sensor data contain critical insights from rare phenomena to anomalies indicative of fraud or failure. To decipher valuables from the counterfeit, analysts need to interactively sift through and explore the data deluge. By detecting anomalies, analysts may prevent fraud or prevent catastrophic sensor failures. While previously developed research offers a treasure trove of stand-alone algorithms for detecting particular types of outliers, they tend to be variations on a theme. There is no end-to-end paradigm to bring this wealth of alternate algorithms to bear in an integrated infrastructure to support anomaly discovery over potentially huge data sets while keeping the human in the loop.
This project is the first to design an integrated paradigm for end-to-end anomaly discovery. This project aims to support all stages of anomaly discovery by seamlessly integrating outlier-related services within one integrated platform. The result is a database-system inspired solution that models services as first class citizens for the discovery of outliers. It integrates outlier detection processes with data sub-spacing, explanations of outliers with respect to their context in the original data set, feedback on the relevance of outlier candidates, and metric-learning to refine the effectiveness of the outlier detection process. The resulting system enables the analyst to steer the discovery process with human ingenuity, empowered by near real-time interactive responsiveness during exploration. Our solution promises to be the first to achieve the power of sense making afforded by outlier explanation services and human feedback integrated into the discovery process.
This project is supported by NSF IIS and NSF CSSI.

AI for Data Curation and Data Curtion for AI

In many organizations, data is spread across the enterprise, and often inconsistently structured, incomplete, and unlinked. To conduct data analytics on this poorly maintained data, data scientists have to go through a data curation process to find, merge, and clean data sets. Despite years of research of this data curation problem, many data scientists report spending 80\% or more of their time on such problems.
Inspired by foundation models (FM) that are prevalent in natural language processing (NLP), such as GPT-3, we propose a data curation-native foundation model as the core component of a data curation system, called dataCore, which is suitable for a wide range of data curation tasks, scales to large datasets, provides human-interpretable explanations, and can be combined with human effort in data curation pipelines.
Unlike prior work that has attempted to use deep learning for data curation tasks, in dataCore, our goal is to avoid requiring users to perform dataset- or task-specific training. Instead, we will train dataCore on a vast array of open text data and a variety of open-source data curation tasks and datasets, with the goal of allowing it to generalize to new domains and datasets which it was not originally trained for, and to help it solve data curation problems across many datasets with varying schemas and statistical properties. This means dataCore users can use it, for example, to integrate and clean their own data without having to supply any labeled data for additional training. DataCore will also offer a natural-language interface, similar to prompts in GPT-3, to allow users to inject domain knowledge to guide the model.

Database Privacy and Security

Under Development.

Distributed/Cloud Databases

Under Development.