Cleaning

cleaning_icon Drop Column

Description

Drop Column Node drops selected column(s) in a data table.

Hint

For a detailed walkthrough see the step-by-step guide.

Parameters

The Drop Column Node requires three parameters, input dataframe, a column to be deleted and a name for a new variable. Dataframe entry input expects variable rectangle with Dataframe, input Columns can be selected from the combobox and new bariable expects string.

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Column(s)

Combobox option

A name of the column or name of columns to be deleted. More than one column can be selected from the combobox.

New variable

String entry

A name for the new Dataframe variable

Step-by-step guide

cleaning_icon Rename Column

Description

Rename Column Node sets a new name (header) to the selected column.

Hint

For a detailed walkthrough see the step-by-step guide.

Parameters

The Rename Column Node requires 2 parameters (other than input dataframe and new dataframe name), a column whose name is to be changed and the new name.

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Column

Comboentry

The column to be renamed. The name of the column can be either written or selected from the combobox.

New name

String entry

A new name which will be used as a header for the selected column, e.g. “New_column_name”.

New variable

String entry

A name for the new Dataframe variable

Step-by-step guide

cleaning_icon Select Columns

Description

Select Column Node takes out selected column(s) from a data table.

Hint

For a detailed walkthrough see the step-by-step guide.

Parameters

The Select Column Node requires at least one parameter, a column(s) to be selected from the data table.

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Columns to select

Combobox option

A name of the column or name of columns to be selected from the data. More than one column can be chosen in the combobox.

New variable

String entry

A name for the new Dataframe variable

Step-by-step guide

cleaning_icon Add Constant Column

Description

Add Constant Column Node creates a new constant column of length equal to that of the data frame.

Hint

For a detailed walkthrough see the step-by-step guide.

Parameters

Add Constant Column Node requires 2 parameters, a value, i.e. a number which will fill in the constant column, and a new column name.

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Value

Integer/Float entry

A constant value filling the new column.

Column name

String entry

A name (header) of the newly created constant column, e.g. “A_new_constant_column”.

New variable

String entry

A name for the new Dataframe variable

Step-by-step guide

cleaning_icon Round To Higher Frequency

cleaning_icon Column Math Operation (Outdated)

Description

Column Math Operation Node performs a mathemathical operations on the dataframes columns, e.g. sums values of two or more columns.

Parameters

The number of parameters is dependend on the chosen operation. The user can choose the operation in the first combobox which will trigger the revelation of other combobox(es). All operation contain the Result name parameter which defines the name of the resulting column.

Parameter

Type

Description

Choose math operation

combobox

A desired math operation.

Result name

string

A name of the resulting column which will hold the values obtained from the performed operation.

cleaning_icon Search String (Outdated)

Description

Search String Node returns all occurences of strings satisfying given pattern.

Parameters

Search String Node requires 2 parameters:

Parameter

Type

Description

Column

combobox

The column selected for a string search.

Pattern

string

A string pattern used for search.

cleaning_icon Replace String

Description

Replace String Node replaces all strings (or sub-strings) in selected column which either contain some pattern or exactly match it.

Parameters

Replace String Node requires 4 parameters:

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Replace in columns

Combobox option

Column(s) selected for the string replacement.

Match type

Combobox option

pattern all strings containing the pattern will be replaced; exact only string exactly matching the pattern will be replaced

Replace substring

Combobox option

Enables to replace substrings.

Pattern

String entry

The pattern used in the replacement process.

Replacement

String entry

The string which will be replaced for all the selected values.

New variable

String entry

A name for the new Dataframe variable

cleaning_icon Filter Data

Description

Takes a dataset and removes or selects the data that satisfies given expression. For example let’s say we have a phonebook of all employees working in an international company and we want to select only the contacts for those who work in Germany. Then we would pass an expression looking something like `country` == ‘Germany’.

Parameters

Filter Data Node requires at least 1 parameter:

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Column(s)

Combobox option

Column(s) selected for the string filtering.

Filter by string

String entry

Filter string which is to be filtered.

Keep matched or drop

Combobox option

Set icon to either keep or drop rows satisfying the expression

New variable

String entry

A name for the new Dataframe variable

cleaning_icon Split String

Description

Splits strings in a given column on a given substring and only keeps element on a given position of a resulting split list.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Column

Combobox option

Column selected for the string splitting.

Split on

String entry

String on which the icon splits the strings in a given column.

Select index

String entry

Index position of resulting split list, on which the result should be stored, starting from 0.

Keep old column

Combobox option

Decide whether the old column is or be dropped or not

New column

String entry

A name for the new Dataframe column

New variable

String entry

A name for the new Dataframe variable

Suppose we have a datetime column with dates in the dd/mm/yyyy format, then split on ‘/’ with select index 0 will give us dd value in a column named {new_column}.

cleaning_icon Sort Data

Description

Sorts selected columns in either ascending or descending order.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle

Sort column (2x)

Comboentry

Names of the column to be sorted, e.g. we have a dataframe containing columns Name, Surname, Age, Salary and want to sort it in ascending order by Age and Salary we will enter: Age, Salary

Ascending (2x)

Checkbox

Tick the checkbox for ascending sort.

New variable

String entry

A name for the new Dataframe variable

cleaning_icon Detect Or Remove Outliers

Description

Detect or remove given number of percentage of outliers in a given column(s)

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle.

Outlier mode

Combobox option

Decide whether to detect or remove outliers.

Columns

Combobox option

Choose in which columns the outliers should be detected.

Ratio as outliers

Float entry

Choose percentage of outliers to be found. Choose either ratio or top n.

Top N as outliers

Integer entry

Choose number of outliers to be found. Choose either ratio or top n.

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Remove Duplicates

Description

Removes all duplicate values in selected columns while keeping the first/last occurence or none.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle.

Keep value

Combobox option

Decide handler behaviour. Either keep only first, last, or none of outliers.

Considered Columns

Comboentry

Choose in which columns the duplicates should be detected.

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Remove Empty Rows

Description

Removes empty rows. If some ID columns are filled, but other columns are empty, the filled columns can be ignored.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle.

Mode

Combobox option

Decide whether to detect or remove empty rows.

ID Columns

Comboentry

Choose in which columns the id values are filled (columns can be ignored).

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Find Difference in Data

Description

Compare two dataframes with similar columns and return their difference.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

First dataframe variable rectangle.

Subtract Dataframe

Dataframe entry

Variable rectangle storing subtracted dataframe.

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Column Wise Shift

Description

Inspect two dataframe columns, shift their values so that corresponding values (e.g. name and domain) are located on the same row. Inserts empty cell or deletes filled cell where necessary.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

First dataframe variable rectangle.

Mode

Combobox option

Choose whether rows that do not match should be removed or kept with the other column being shifted and inserted with nan

Reference column

Combobox option

Choose the first column for comparison

Incomplete column

Combobox option

Choose the second column for comparison

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon KNN Imputation

Description

Apply numeric KNN Imputation on selected column.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle.

Column

Comboentry

Choose in which column values should be imputed.

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Imputation

Description

Apply numeric or categorical imputation on selected column.

Parameters

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle.

Imputed Column

Comboentry

Choose in which columns values should be imputed.

Impute choice

Comboentry

Choose how the values should be generated. Either choose function (or zero) from combobox, or write value in entry.

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Concatenate

Description

Concatenates values in the two selected dataframes into a new one.

Parameters

Common parameter

Parameter

Type

Description

Dataframe (2x)

Dataframe entry

Dataframe variable rectangle.

Append

Combobox option

Choose whether rows or columns should be appended

Join

Combobox option

Choose whether join executed should be inner or outer

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Join Dataframes

Description

Joins two dataframes on (possibly muptiple) columns.

Parameters

Common parameter

Parameter

Type

Description

Dataframe (2x)

Dataframe entry

Dataframe variable rectangle.

On columns

Combobox option

Choose columns on which join should be executed

How

Combobox option

Choose join mode - left, right, outer, inner, cross

New variable

String entry

A name for the new Dataframe variable.

cleaning_icon Apply Mapping

cleaning_icon Aggregate Groups

Description

Group (not required) data and execute numerical and/or categorical aggregations on given column(s)

Parameters

Common parameter

Parameter

Type

Description

Dataframe

Dataframe entry

Dataframe variable rectangle.

Columns to Group by

Combobox option

Choose columns on which group by should be executed

Columns to Aggregate

Combobox option

Choose columns on which group by should be executed

Numeric aggregations

Combobox option

Choose numerical aggregations that should be executed on numerical columns - sum, mean, **median, max, min, count, mode

Categorical aggregations

Combobox option

Choose categorical aggregations that should be executed on categorical columns - mode

New variable

String entry

A name for the new Dataframe variable.