How to remove duplicate rows in Spotfire

Products	Versions
Spotfire Analyst	All Versions

Description

There may be situations when your data table gets duplicate records either due to the source data, insertion of columns in data tables, joins in queries, or merging rows from various sources, etc. When working the resulting data you may want to remove those duplicate rows completely from data table or just from specific visualizations.

Resolution

You can choose from the following approaches based on your requirements:

Approach 1) Pivot transformation
This is the easiest approach since you just need to insert a "Pivot" transformation in each data table having duplicates (see Pivoting Data reference for details). You can add a pivot transformation at the time of importing data or after the data is imported. Add all the columns of your data table as "Row Identifiers" and leave the rest empty so all exact duplicate rows will be removed.

See the attached Pivot.dxp which demonstrates this. In the attached analysis, a new data table is created based on pivot transformation in order to show duplicates present in original table, but this transformation can be performed in the original data table as well.

Approach 2) Calculated Column and Filtering

Step 1: Create Calculated column:

Create a calculated column as described in the KB article Identifying duplicate rows using custom expressions. For example:
```
 If(Rank(Baserowid(),"asc",[myGroup1])=1,"First row of group","Duplicate")
```
or

Create a calculated column with a TERR expression. For example:

 TERR_Boolean("output <- duplicated(data.frame(input1,input2))", [col_input1], [col_input2]))

For example, this column could be named "RemoveRows".

Step 2) Set up Filtering:

Removing duplicates from a Data Table:
Option 1) In this approach, you can set filtering on the RemoveRows column to "1st Record" and then hide the filter from the filter panel so it does not get altered accidentally. This needs to be done for each data table where duplicates are present.

Option 2) Create a new table and hide the duplicates with data relations and Filtering. See the attached Relations.dxp for details.

Ensure the calculated column from "Step 1: Create Calculated column" called "RemoveRows" is created in the original data table, called for example "Table1".
In Notepad or Excel, create a column with a single row and value like below (matching the value from the "RemoveRows" column) and copy it:

Record
1st Record

In the Spotfire installed client, go to Add Data table > Paste from clipboard, and name this for example "Table2"
Create a data relation between Table1 and Table2 on columns Table1.RemoveRows and Table2.Record. See Details on Data Table Properties – Relations for more details.
In the Filter Panel, navigate to Table1 > Filtering in Related Tables icon > Include Filtered rows only:

This needs to be applied to each data table from where the duplicates need to be removed. As compared to the first approach, this way the duplicates will not be displayed again even if you reset filters. See Filtering in Related Data Tables for more details.

Note: Data tables that you are relating to Table2 should not form a cyclic dependency if it does then create separate copies of Table2 and then relate those copies to the required target tables. See KB 000037934 How to avoid cyclic dependencies between data tables for more details.

Removing duplicates from specific Visualizations:
Use the "RemoveRows" calculated column from above in the data limit expression of the visualization. Go to the visualization Properties > Data > Limit data using expression > Edit and then insert an expression like

 [RemoveRows]="1st Record"

which means only "1st Record" rows will be present in the visualization.

If you want to remove duplicate values on the fly based on a different set of columns, then in the limit data using expression of the visualization, you can create a custom expression like the following example, as described in KB Identifying duplicate rows using custom expressions:

 Rank(baserowid(),"asc",[Col1])=1

Approach 3) Filter Rows Transformation (version 10.1 and higher)

Starting with Spotfire version 10.1, you can insert calculated column and filter rows transformation at data table level to remove duplicates. More details about Calculated column expression are provided here. For example:

 Select Data > Transform data...     Added transformations         Transformation name: Calculate new column         Column name: RemoveRows         Expression: If(Rank(Rowid(),"asc",[Store ID])=1,"1st Record","Duplicate Record")         Transformation name: Filter rows         Expression: [RemoveRows]="1st Record"