TDV Join Algorithm and how to optimize joins

Products	Versions
TIBCO Data Virtualization	ALL VERSIONS

Description

Resolution

Studio Join Properties editor offers the following join algorithm options:

Automatic Option
Hash Join Option
NESTEDLOOP Join Option
SORTMERGE Join Option
Semijoin Optimization Option
Star Schema Semijoin

Please refer documentation for me details.
When we think avout OPTIMIZATION, It is always a combination of the hardware, memory, query optimization and the overall project design.

This KB article will give guidelines to optimize JOINs. Below are some of broad level “Best Practices Guidelines” are given your convenience:

1. Capacity planning according to your project needs.

2. Optimize query plans by eliminating the nodes to prevent local crunching at CIS that can be pushed to the native data source instead.
See the ranking in terms of general efficiency. Preference may vary in specific scenarios.

Nested Loop (Slowest)
Hash
Sort-Merge
Semi-Join (Fastest)

Manage Join order, by observing table sizes, result-set sizes, join cardinalities, timing of execution with respect to overall load at that time of the day.

Validate that the TDV is using the proper Join Order

The node on left hand side should always be of smaller cardinality than that on the right.
The first node under the JOIN node in “Show Execution Plan” is the left side.
All outer joins should be automatically converted to right outer joins,
Join ordering is important for three reasons

The smaller table will take up less memory.
Less likely to swap to disk (10x performance hit).
It is more efficient to create and load a smaller hash table into memory.

Implement Caching for frequently fetched data.

Use Database managed cache if queries will apply selections against cache (leverage indexed columns).
Use Database managed cache if cached views to be joined so that the join may be pushed down.
Use Database managed to cache source tables from 2 different data sources to allow pushdown of joins.
Cache source table to get snapshots for consistent data on volatile sources.
Cache lookup (small, static) tables.

Semi-Join ( semijoin) configuration recommendations:
It very much depends on the queries that are going to be run, and the cardinalities of the result sets.

The big downside to semi-join is batching. Running multiple batches will slow you down, and becomes the threshold at which you might want to consider dataship.

I would typically suggest semi-join is okay for handling row cardinalities of less than 1 million on the source side. That will depend on the number of batches that get generated on the target side. But at least it is a starting point.

In terms on the min target to source ratio, it really depends. And it will require up to date datasource stats within CIS. I personally would have a low ratio, as long as there is an index on the filtered columns in the target source. My rationale here is that any filter is better than no filter.

I would say semijoin is preferred over sort-merge for target Oracle sources with large-ish result sets (100k plus rows). I believe Oracle uses a sort-buffer for the order by clause, and may re-run the original query multiple times to stuff data in the buffer. Avoiding the order by clause (and hence the sort merge join in CIS) can be beneficial in these cases.

Issue/Introduction

TDV Join Algorithm and how to optimize joins

Welcome to "KB Articles"