eazieda package

Submodules

eazieda.corr_plot module

eazieda.corr_plot.corr_plot(data, features=None, method='pearson', plot_width=500, plot_height=400)[source]

Generates a correlation plot for a list of features in a given dataframe

Parameters
  • data (pandas.core.frame.DataFrame) – The input dataframe

  • features (list, default = None) – A list of strings that represents feature names len(features) >=2 None returns plot of all numeric features

  • method (str, default = "pearson") – The correlation method Other correlation methods are “spearman” or “kendall”

  • plot_width (int, default = 500) – The width of the plot

  • plot_height (int, default = 400) – The height of the plot

Returns

An interactive altair correlation plot

Return type

altair plot

Examples

>>> from eazieda.corr_plot import corr_plot
>>> from vega_datasets import data
>>> df = data.iris()
>>> corr_plot(df, ["petal_length", "petal_width", "sepal_length"],
>>>  "pearson")

eazieda.histograms module

eazieda.histograms.histograms(data, features, plot_width=100, plot_height=100, num_cols=2)[source]

Generates histograms for numeric features and bar plots for categorical features

Parameters
  • data (pandas.core.frame.DataFrame) – A Pandas Dataframe

  • features (list) – A list of strings that represents feature names

  • plot_width (int) – The width of each features sub plot. Default = 100

  • plot_height (int) – The height of each features sub plot. Default = 100

  • num_cols (int) – The number of columns in the final grid of plots. Default = 2

Returns

A combined altair correlation plot

Return type

altair plot

Examples

>>> from eazieda.histograms import histograms
>>> from vega_datasets import data
>>> df = data.iris()
>>> histograms(df, ['petalLength', 'petalWidth', 'sepalLength'],
>>> num_cols=2)

eazieda.missing_detect module

eazieda.missing_detect.missing_detect(data)[source]

Return the number/percentage of missing values for each column in the dataframe

Parameters

data (pandas.core.frame.DataFrame) – A Pandas Dataframe for which the missing values need to be detected

Returns

A dataframe containing two columns: the number of missing values and the percentage of missing values for each column

Return type

pandas.core.frame.DataFrame

Examples

>>> from eazieda.missing_detect import missing_detect
>>> df = pd.DataFrame([[1, "x"], [np.nan, "y"], [2, np.nan], [3, "y"]],
>>> columns = ['a', 'b'])
>>> missing_detect(df)
    n_missing       percent
a   1               25%
b   1               25%

eazieda.missing_impute module

eazieda.missing_impute.missing_impute(data, method_num='mean', method_non_num='most_frequent')[source]

Return the imputed version of data based on the methods selected

Parameters
  • data (pandas.core.frame.DataFrame) – A Pandas Dataframe for which the missing values need to be detected

  • method_num (str, default = "mean") – The method used for imputing numerical missing values One of ‘drop’, mean’, ‘median’

  • method_non_num (str, default = "most_frequent") – The method used for imputing non-numerical missing values. One of ‘drop’, ‘most_frequent’

Returns

A imputed dataframe

Return type

pandas.core.frame.DataFrame

Examples

>>> from eazieda.missing_impute import missing_impute
>>> df = pd.DataFrame([[1, "x"], [np.nan, "y"], [2, np.nan], [3, "y"]],
>>> columns = ['a', 'b'])
>>> missing_impute(df)
    a       b
0   1       x
1   2   y
2   2   y
3   3       y

eazieda.outliers_detect module

eazieda.outliers_detect.outliers_detect(s, method='zscore')[source]

Detects outliers in a pandas series

Parameters
  • s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found

  • method (str, default = "zscore") – The algorithm/method used for outlier detection. One of ‘zscore’, ‘iforest’, ‘iqr’

Returns

Boolean array with same length as the input, indices of outlier marked.

Return type

numpy.array

Examples

>>> from eazieda.outliers_detect import outliers_detect
>>> s = pd.Series([1,1,1,1,1,1,1,1,1,1,1e14])
>>> outliers_detect(s)
array([False, False, False, False, False, False, False, False, False,
    True])
eazieda.outliers_detect.outliers_detect_iforest(s)[source]

Detects outliers in a pandas series using isolation forests

Parameters

s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found

Returns

Boolean array with same length as the input, indices of outlier marked.

Return type

numpy.array

Examples

>>> from eazieda.outliers_detect import outliers_detect_iforest
>>> s = pd.Series([1,2,1,2,1, 1000])
>>> outliers_detect_iforest(s)
array([False, False, False, False, False,  True])
eazieda.outliers_detect.outliers_detect_iqr(s, factor=1.5)[source]

Detects outliers in a pandas series using inter-quantile ranges

Parameters
  • s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found

  • factor (int) – iqr factor used for outliers

Returns

Boolean array with same length as the input, indices of outlier marked.

Return type

numpy.array

Examples

>>> from eazieda.outliers_detect import outliers_detect_iqr
>>> s = pd.Series([1,2,1,2,1, 1000])
>>> outliers_detect_iqr(s)
array([False, False, False, False, False,  True])
eazieda.outliers_detect.outliers_detect_zscore(s, threshold=3)[source]

Detects outliers in a pandas series using zscores

Parameters
  • s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found

  • threshold (int) – zscore threshold used for outliers

Returns

Boolean array with same length as the input, indices of outlier marked.

Return type

numpy.array

Examples

>>> from eazieda.outliers_detect import outliers_detect_zscore
>>> s = pd.Series([1,1,1,1,1,1,1,1,1,1,1e14])
>>> outliers_detect_zscore(s)
array([False, False, False, False, False, False, False, False, False,
    True])
eazieda.outliers_detect.remove_outliers(s, outliers, inplace=False)[source]

Drops outliers from the given series

Parameters
  • s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found

  • outliers (numpy.array) – boolean numpy array with the same length as s. Outliers should be marked with True.

  • inplace (boolean) – do the removal inplace

Returns

series with outliers removed. None if inplace=True.

Return type

None or pd.Series

Examples

>>> from eazieda.outliers_detect import remove_outliers
>>> s = pd.Series([1,1e14])
>>> outliers = np.array([False,,True])
>>> remove_outliers(s, outliers)
>>> s
0    1.0
dtype: float64

Module contents