I am an Assistant Professor at the Computer Science department of University of Arizona. I also hold a research affiliation at MIT CSAIL where I spent several years as a Postdoc Associate and then a Research Scientist, actively collaborating with Prof. Samuel Madden, Prof. Michael Stonebraker, Prof. Tim Kraska, and Dr. Michael Cafarella. Before that I worked for IBM T.J. Watson Research Center as a Research Staff Member. I have conducted research in the broad areas of data systems and data science ranging from the low-level core database performance optimization to designing the high level, application specific machine learning techniques. My recent research falls in the emerging area of "Systems for AI and AI for Systems", focused on building data management and analytics tools that satisfy the SAUL properties: Scalable, Automatic, Human-in-the-loop.

[CV] [Google Scholar]

Research Interests

  • Anomaly Detection Algorithms and Systems
  • AI (LLM) for Data Management, Data Management for AI (LLM)
  • Distributed/Cloud Databases
  • Data Lakes

Students

  • PhD students
    Junyong Zhao (2022 fall)
    Ruoshan Lan (2023 Spring)
    Han Han (2023 Fall)
  • Master students
    Zhenyu Qi
  • Other students I am working with
    Zui Chen, PhD MIT
    Ferdi Kossmann, PhD MIT
    Ziniu Wu, PhD MIT
    Jiaming Liang, PhD PENN
    Lei Ma, PHD WPI
    Dennis Hofmann, PhD WPI

News

  • 2024-03, I received the Amazon Research Award (Fall 2023). Jointly with the MIT collaborators and Amazon researchers we will continue to work on SEED: an LLM for Data system.
  • 2024-03, Our paper "LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes" was accepted by VLDB 2024.
  • 2024-02, Our paper "Outlier Summarization via Human Interpretable Rules" was accepted by VLDB 2024.
  • 2024-02, We will organize a data centric AI workshop at VLDB 2024 (GuangZhou, China).
  • 2024-01, Our NSF proposal "An Automated High-Content Imaging Platform for Caenorhabditis elegans" got awarded (1.3M in total).
  • 2024-01, Our paper "MetaStore: Deep Learning Meta-Data Analytics at Scale" was accepted by VLDB 2024.
  • 2023-12, I will serve as the Associate Editor of VLDB2025.
  • 2023-12, Check out our most recent work on using LLMs to solve data curation problems: SEED: Domain-Specific Data Curation With Large Language Models
  • 2023-11, Our paper "RITA: Group Attention is All You Need for Timeseries Analytics" was accepted by SIGMOD 2024.
  • 2023-11, Our paper "MisDetect: Iterative Mislabel Detection using Early Loss" was accepted by VLDB 2024.
  • 2023-10, Our paper "VerifAI: Verified Generative AI" was accepted by CIDR 2024.
  • 2023-06, Our demo paper "Lingua Manga: A Generic Large Language Model Centric System for Data Curation" was accepted by PVLDB 2023. It is about using LLM to build a generic data curation system.
  • 2023-04, paper "Extract-Transform-Load for Video Streams" was accepted by PVLDB2023.
  • 2023-02, paper "Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning" was accepted by SIGMOD2023.
  • 2023-01, Zui Chen, the undergraduate I supervise got admitted to the PhD program of MIT CSAIL. His research is about NLP for Data Curation.
  • 2022-12, our joint team with MIT (Dr. Michael Cafarella) and PENN (Jiaming Liang, PhD) won the third place in the Georeferencing challenge organized by DARPA.
  • 2022-11, paper "SYMPHONY: Towards Natural Language Query Answering over Multi-modal Data Lakes" was accepted by CIDR2023.
  • 2022-09, joined the Computer Science Department at University of Arizona.
  • 2022-08, our work "AutoOD: Automatic Outlier Detection" was accepted by SIGMOD 2023 and PVLDB 2022 (Demo).
  • 2022-04, paper "Scalable Motif Counting for Large-scale Temporal Graphs" was accepted by ICDE 2022.
  • 2022-02, serving as a PC member of PVLDB2023, ICDE 2023, EDBT 2023, CIKM 2022, and WSDM 2023.
  • 2021-07, our paper "ATLANTIC: Making Database Differentially Private and Faster with Accuracy Guarantee" was accepted by PVLDB 2021 (Demo).
  • 2021-06, our paper "LANCET: Labeling Complex Data at Scale" was accepted by PVLDB 2021.
  • 2021-05, our proposal "Elements: A Self-tuning Anomaly Detection Service" got funded by NSF CSSI.
  • 2021-02, our paper "Elite: Robust Deep Anomaly Detection" was accepted by KDD 2021.
  • 2021-01, our project "LANCET: Labeling Complex Data at Scale" got funded by CSAIL Alliances.
  • 2021-01, our paper "Epoch-based commit and replication in distributed OLTP databases" was accepted by PVLDB 2021.