Joining complex data types

Interval joins

Description

The interval overlap join is among the most widely used operators in temporal databases. Essentially, temporal databases store relations of explicit attributes that conform to a schema while data objects carry a validity interval. In this context, an interval overlap join identifies object pairs (or combinations) with overlapping validity. Besides temporal databases, interval joins also find application in other domains; in multi-dimensional spaces where an object can be represented as a set of intervals in a space-filling curve or in uncertain data management where uncertain values are represented as intervals (which can be paired with confidence values).

In this context, plane sweep is investigated as a competitive approach for interval joins; in brief, we proposed an optimized version of plane sweep which has the potential of performing fewer comparisons than the actual join output size. In addition, we devised a novel orthogonal parallel paradigm to benefit from modern hardware. In our follow up work, we studied temporal aggregation based on interval joins by introducing a semi-join operation.

Publications

Source code

 

Set joins

Description

Sets are ubiquitous in computer science; they model transactions, scientific data, Web data, text etc. Contemporary data management systems allow the definition of set-valued data attributes and support operations such as containment queries. Join operations are also extended to include predicates on sets such as containment and similarity.

Regarding the former, we proposed (i) an adaptive methodology that alleviates the shortcomings of the state-of-the-art method which indexes the inputs using a prefix tree and an inverted index, and (ii) a novel join paradigm which interleaves indexing with the actual join process. We also studied the parallel computation of set containment joins. Regarding the similarity predicate, we proposed the grouping technique which builds on top of the widely-adopted prefix filtering allowing to batch process objects with identical prefix. Recently, another line of our work focuses on experimentally comparing set similarity join evaluation methods both on centralized and distributed environments.

Publications

Source code

 

Spatio-textual joins

Description

We introduced the Spatio-Textual Similarity Join operation (ST-SJOIN) to identify object pairs (or combinations in case of multiple inputs) which are closely located in space and have a similar textual description. This is the first work on joining objects carrying spatial and text information. Intuitively, ST-SJOIN comes as a hybrid query operator which applies a join predicate in two dimensions of the data at the same time, and therefore, it finds application in a wide range of domains. In social network analysis, ST-SJOIN can be used for social recommendations; e.g., to recommend potential friend relationships between people with common interests (modeled as a keyword set), that live or socialize in nearby areas. As another application, ST-SJOIN can improve the effectiveness of data de-duplication, i.e., the task of retrieving near-identical objects; e.g., identifying photos that picture the same object in the photo-sharing services of Flickr or Instagram.

Publications

 

Spatial joins

Description

The spatial join is a popular operation in spatial database systems and its evaluation is a well-studied problem. Despite the vast amount of research output on the subject, the complexity of the data and the consideration modern commodity hardware, which supports parallel processing, opens the road to a number of interesting directions for future research. We briefly survey of the background in spatial join computation and outline these directions. In an effort to cover some of modern challenges, we investigates the in-memory and parallel evaluation of spatial joins, by tuning a classic partitioning based algorithm. The study shows that, compared to a straightforward implementation of the algorithm, performance can be significantly improved by properly selecting partitioning parameters based on data statistics, in order to tune the algorithm for the given join inputs. The proposed parallel implementation scales gracefully with the number of threads reducing the cost of the join to at most one second even for join inputs with tens of millions of spatial objects.

Publications

  • Dimitrios Tsitsigkos, Panagiotis Bouros, Nikos Mamoulis and Manolis Terrovitis:
    Parallel In-Memory Evaluation of Spatial Joins
    Proceedings of the 27th ACM International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL), Chicago, Illinois, USA, November 5-8, 2019
  • Panagiotis Bouros and Nikos Mamoulis:
    Spatial Joins: What's next?
    SIGSPATIAL Special, Vol 11, No 1, March 2019

 

Ranking joins

Description

The amount of join results are often overwhelming; as only a subset would be further examined, a top-k join operator which retrieves the best k results comes of greater practical value. Top-k joins have been extensively studied in the past but only for relational equijoins, i.e., for primitive (numerical) join attributes and the equality predicate.

To fill this gap, we introduced a generalized ranking join operator on complex data types. For example, spatial locations as join attributes can be used to recommend to the visitors of a city the best k pairs of restaurants and hotels within short distance from each other that have the highest combined ratings. Top-k join for string join attributes finds application in tasks like data integration where the score ratings from different data sources are combined, or data cleaning and de-duplication. The evaluation of relational ranking joins is challenged by the cost of accessing the data objects; however, in case of complex data types the computational cost can easily become the bottleneck. In view of this, I proposed a novel evaluation paradigm which accesses objects from the inputs in blocks to reduce the CPU cost but without compromising at the same time, the access or I/O cost.

Publications