Can all implemented Hadoop methods run on multiple nodes in a distributed manner?

book

Article ID: KB0076896

calendar_today

Updated On:

Products	Versions
TIBCO Spotfire Data Science	6.5.0

Description

In Team Studio, with Hadoop operators, Hadoop is able to implement several algorithms based on the conditions and configurations of the cluster. If all implemented methods are used in Team Studio, are they able to run on different configurations of a Hadoop cluster?

That is:
1) Can these operators run on single nodes vs. multiple nodes?
2) Could you force the operator to run a specific way (like copy everything on one node and compute here without distribute compute/without possibility to use for computation the power of more nodes in the cluster)?

Resolution

Hadoop Operators in Team Studio do not take into account how the Hadoop Cluster is configured. In a workflow, Team Studio will build the code and submit this to the cluster. The Cluster will then determine the best way to process the job at run time. This is decided entirely by the Resource Manager unless specified by the user who submits the job.

A workflow or job can run on a single node if:

there is a single node cluster
resource usage is restricted either by using config files or from the source code
a sequential execution code is written without implementing the parallel/distributed frameworks (MR/Spark)

The user can actually write operators to run on the server where Team Studio is installed (in-memory PCA is an example) but it is not recommended as Team Studio app consumes lot of the resources

In spark you can force all the computation to single node by using a single partition for your dataframe (df.repartition(1))

In MR you can force all the computation to single node by using a single by using single mapper and reducer

If you are talking our operators that ship with the product, no we DO NOT artificially restrict anything to run in single node but the code is structured in a way that it will run in single node if it has to.

Issue/Introduction

In regards to the Hadoop operators and their methods, are they all able to run on multiple nodes in a distributed manner.

Was this article helpful?

thumb_up Yes

thumb_down No

Welcome to "KB Articles"