Joining complex data types

Temporal joins

Description

The interval overlap join is among the most widely used operators in temporal databases. Essentially, temporal databases store relations of explicit attributes that conform to a schema while data objects carry a validity interval. In this context, an interval overlap join identifies object pairs (or combinations) with overlapping validity. Besides temporal databases, interval joins also find application in other domains; in multi-dimensional spaces where an object can be represented as a set of intervals in a space-filling curve or in uncertain data management where uncertain values are represented as intervals (which can be paired with confidence values).

In this context, plane sweep is investigated as a competitive approach for interval joins; in brief, we proposed an optimized version of plane sweep which has the potential of performing fewer comparisons than the actual join output size. In addition, we devised a novel orthogonal parallel paradigm to benefit from modern hardware. In our follow up work, we studied temporal aggregation based on interval joins by introducing a semi-join operation.

 

Publications

 

Source code

 

Set joins

Description

Sets are ubiquitous in computer science; they model transactions, scientific data, Web data, text etc. Contemporary data management systems allow the definition of set-valued data attributes and support operations such as containment queries. Join operations are also extended to include predicates on sets such as containment and similarity.

Regarding the former, we proposed (i) an adaptive methodology that alleviates the shortcomings of the state-of-the-art method which indexes the inputs using a prefix tree and an inverted index, and (ii) a novel join paradigm which interleaves indexing with the actual join process. We also studied the parallel computation of set containment joins. Regarding the similarity predicate, we proposed the grouping technique which builds on top of the widely-adopted prefix filtering allowing to batch process objects with identical prefix. Recently, another line of our work focuses on experimentally comparing set similarity join evaluation methods both on centralized and distributed environments.

 

Publications

 

Source code

 

Spatio-textual joins

Description

We introduced the Spatio-Textual Similarity Join operation (ST-SJOIN) to identify object pairs (or combinations in case of multiple inputs) which are closely located in space and have a similar textual description. This is the first work on joining objects carrying spatial and text information. Intuitively, ST-SJOIN comes as a hybrid query operator which applies a join predicate in two dimensions of the data at the same time, and therefore, it finds application in a wide range of domains. In social network analysis, ST-SJOIN can be used for social recommendations; e.g., to recommend potential friend relationships between people with common interests (modeled as a keyword set), that live or socialize in nearby areas. As another application, ST-SJOIN can improve the effectiveness of data de-duplication, i.e., the task of retrieving near-identical objects; e.g., identifying photos that picture the same object in the photo-sharing services of Flickr or Instagram.

 

Publications

 

Ranking joins

Description

The amount of join results are often overwhelming; as only a subset would be further examined, a top-k join operator which retrieves the best k results comes of greater practical value. Top-k joins have been extensively studied in the past but only for relational equijoins, i.e., for primitive (numerical) join attributes and the equality predicate.

To fill this gap, we introduced a generalized ranking join operator on complex data types. For example, spatial locations as join attributes can be used to recommend to the visitors of a city the best k pairs of restaurants and hotels within short distance from each other that have the highest combined ratings. Top-k join for string join attributes finds application in tasks like data integration where the score ratings from different data sources are combined, or data cleaning and de-duplication. The evaluation of relational ranking joins is challenged by the cost of accessing the data objects; however, in case of complex data types the computational cost can easily become the bottleneck. In view of this, I proposed a novel evaluation paradigm which accesses objects from the inputs in blocks to reduce the CPU cost but without compromising at the same time, the access or I/O cost.

 

Publications