Whenever there is a scalar sub-query implemented in our query, the total query execution takes significantly longer than if the query is directly executed on the Cloudera Impala/Hive database via any 3rd party client tool.
An example of such a subquery is as below:
SELECT * FROM /shared/scalar/test/"default"/tab1
WHERE dt = (SELECT dt2 FROM /shared/scalar/test/"default"/tab2
WHERE dt_key2= '2014-08-20')
In the above example, the 2nd SELECT statement forms a scalar sub-query.
If we check the query execution plan, it mentions a No Push Reason as "
No Push Reason: Cannot push sub query because data source Impala does not support scalar sub queries".
And since the query is not pushed onto the database and all the execution takes place inside TIBCO Data Virtualization (TDV) itself, TDV fetches the data twice, once per SELECT statement. This results in high execution time.