Outlier removal#
- cleansing.outlier_removal.run(data, ref, max_std_dist=2.0, min_samp_cnt=5, axis=0)#
Remove any sample more distant from the mean than max_std_dist standard deviations. Terminates if either all samples are within the threshold or if the minimal sample count defined by min_samp_cnt is reached.
- Parameters:
data (np.ndarray, shape(samples, )) – To be cleaned data.
ref (np.ndarray, shape(samples, )) – Reference data to clean the data variable by. In most cases equal to data.
max_std_dist (float) – Threshold for outlier detection. Number of standard deviations permissible.
min_samp_cnt (int) – Minimal viable sample count. Terminates if reduced below this number or the current iteration would reduce below this number.
axis (int) – Axis on which to evaluate the data object.
- Returns:
Filtered array without outlier more different than n standard deviations.
- Return type:
np.ndarray(samples - filtered_samples,)
The following code example shows how to apply statistical outlier removal.
import numpy as np
import random
import matplotlib
matplotlib.use("Qt5agg")
import matplotlib.pyplot as plt
import finn.cleansing.outlier_removal as om
def main():
#Configure sample data
channel_count = 32
data_range = 100
#Configure niose
noise_count = int(data_range * 0.05)
#Generate sample data
raw_data = [None for _ in range(channel_count)]
for ch_idx in range(channel_count):
raw_data[ch_idx] = np.random.normal(0, 2, data_range)
for noise_idx in [random.randint(0, data_range - 1) for _ in range(noise_count)]:
raw_data[ch_idx][noise_idx] = np.random.randint(1, 10)
#Filter data
filtered_data_2 = [None for _ in range(channel_count)]
for ch_idx in range(channel_count):
filtered_data_2[ch_idx] = om.run(raw_data[ch_idx], raw_data[ch_idx], max_std_dist = 2, min_samp_cnt = 0)
filtered_data_3 = [None for _ in range(channel_count)]
for ch_idx in range(channel_count):
filtered_data_3[ch_idx] = om.run(raw_data[ch_idx], raw_data[ch_idx], max_std_dist = 3, min_samp_cnt = 0)
#Visualize results
plot_channel_idx = 0
plt.scatter(raw_data[plot_channel_idx], raw_data[plot_channel_idx], color = "blue", label = "original samples", s = 150)
plt.scatter(filtered_data_3[plot_channel_idx], filtered_data_3[plot_channel_idx], color = "red", label = "3 standard deviations", s = 50)
plt.scatter(filtered_data_2[plot_channel_idx], filtered_data_2[plot_channel_idx], color = "black", label = "2 standard deviations", s = 10)
plt.legend()
plt.show(block = True)
main()
Outliers more than more than three (or two) standard deviations may be isolated and removed from the data set.