Research

Data-Centric Computing

The Data-Centric Computing cluster integrates data semantics and application behavior to address critical challenges in the management of large scale data for data intensive applications and to achieve increased performance, accuracy, availability, and reliability. The scale of the data means that the entire data set will often not fit in main memory or even on a single disk. The overarching goal is to manage data at all points along the memory hierarchy, from main memory to disks to clusters of disks to archival tape to across the web. Examples of large-scale data relevant to this cluster include image data, multi-dimensional data, spatial data, temporal data, and spatio-temporal data from astronomic, sensor net, wireless, and pervasive applications.

Faculty

Kobus Barnard
Alon Efrat
John Hartman
Stephen Kobourov
Bongki Moon
Richard Snodgrass

Projects

Graphael: Generalized Force-Directed Layouts
PRIX: XML Indexing and Twig Query Processing
Simultaneous Graph Drawing
TAU: Management of Time-Oriented Data
TGRIP: Temporal Graph Drawing with Intelligent Placement
TimeCenter: Support of Temporal Database Applications
XISS: XML Indexing and Storage System

Extended Description

The data-centric computing research cluster integrates data semantics and application behavior to address critical challenges in the management of large scale data for data intensive applications. Our goals are increased performance, accuracy, availability, and reliability.

The scale of the data means that the entire data set will often not fit in main memory or even on a single disk. Hence, a key issue in our research is the incorporation of the memory hierarchy into the computing models we design. A second key issue is exploiting application-dependent access patterns in this context.

The value of large data sets resides in the capability for extracting information and knowledge, which typically requires additional infrastructure for organizing, storage, indexing, and querying. Increasingly, this infrastructure must be tailored to the kinds of analyses that will be done. In particular, the underlying data delivery infrastructure should be driven by understanding the characteristics of the application-specific workload. Thus an important defining characteristic of our approach is making these systems explicitly aware of and adaptable to a static or changing workload.

The overarching goal is thus to manage data at all points along the memory hierarchy, from main memory to disks to clusters of disks to archival tape to across the web. Critically, the management must exploit the specifics of the applications being supported. Operationally, we must provide methodologies and protocols for the layers to communicate the relevant information regarding application needs. These are important and challenging problems.

Examples of large-scale data we are interested in include image data, multi-dimensional data, spatial data, temporal data, and spatio-temporal data from astronomic, sensor net, wireless, and pervasive applications. Examples of application-level semantics we hope to use are random sampling, spatial and temporal locality, accessing different kinds of slices, and more complex accessing patterns. Ultimately, we envision a data system that makes certain data mining endeavors more tractable by using the knowledge of work loads to drive mechanisms to improve availability and reliability through the prediction of future data access patterns.