eazieda package¶
Submodules¶
eazieda.corr_plot module¶
-
eazieda.corr_plot.corr_plot(data, features=None, method='pearson', plot_width=500, plot_height=400)[source]¶ Generates a correlation plot for a list of features in a given dataframe
- Parameters
data (pandas.core.frame.DataFrame) – The input dataframe
features (list, default = None) – A list of strings that represents feature names len(features) >=2 None returns plot of all numeric features
method (str, default = "pearson") – The correlation method Other correlation methods are “spearman” or “kendall”
plot_width (int, default = 500) – The width of the plot
plot_height (int, default = 400) – The height of the plot
- Returns
An interactive altair correlation plot
- Return type
altair plot
Examples
>>> from eazieda.corr_plot import corr_plot >>> from vega_datasets import data >>> df = data.iris() >>> corr_plot(df, ["petal_length", "petal_width", "sepal_length"], >>> "pearson")
eazieda.histograms module¶
-
eazieda.histograms.histograms(data, features, plot_width=100, plot_height=100, num_cols=2)[source]¶ Generates histograms for numeric features and bar plots for categorical features
- Parameters
data (pandas.core.frame.DataFrame) – A Pandas Dataframe
features (list) – A list of strings that represents feature names
plot_width (int) – The width of each features sub plot. Default = 100
plot_height (int) – The height of each features sub plot. Default = 100
num_cols (int) – The number of columns in the final grid of plots. Default = 2
- Returns
A combined altair correlation plot
- Return type
altair plot
Examples
>>> from eazieda.histograms import histograms >>> from vega_datasets import data >>> df = data.iris() >>> histograms(df, ['petalLength', 'petalWidth', 'sepalLength'], >>> num_cols=2)
eazieda.missing_detect module¶
-
eazieda.missing_detect.missing_detect(data)[source]¶ Return the number/percentage of missing values for each column in the dataframe
- Parameters
data (pandas.core.frame.DataFrame) – A Pandas Dataframe for which the missing values need to be detected
- Returns
A dataframe containing two columns: the number of missing values and the percentage of missing values for each column
- Return type
pandas.core.frame.DataFrame
Examples
>>> from eazieda.missing_detect import missing_detect >>> df = pd.DataFrame([[1, "x"], [np.nan, "y"], [2, np.nan], [3, "y"]], >>> columns = ['a', 'b']) >>> missing_detect(df) n_missing percent a 1 25% b 1 25%
eazieda.missing_impute module¶
-
eazieda.missing_impute.missing_impute(data, method_num='mean', method_non_num='most_frequent')[source]¶ Return the imputed version of data based on the methods selected
- Parameters
data (pandas.core.frame.DataFrame) – A Pandas Dataframe for which the missing values need to be detected
method_num (str, default = "mean") – The method used for imputing numerical missing values One of ‘drop’, mean’, ‘median’
method_non_num (str, default = "most_frequent") – The method used for imputing non-numerical missing values. One of ‘drop’, ‘most_frequent’
- Returns
A imputed dataframe
- Return type
pandas.core.frame.DataFrame
Examples
>>> from eazieda.missing_impute import missing_impute >>> df = pd.DataFrame([[1, "x"], [np.nan, "y"], [2, np.nan], [3, "y"]], >>> columns = ['a', 'b']) >>> missing_impute(df) a b 0 1 x 1 2 y 2 2 y 3 3 y
eazieda.outliers_detect module¶
-
eazieda.outliers_detect.outliers_detect(s, method='zscore')[source]¶ Detects outliers in a pandas series
- Parameters
s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found
method (str, default = "zscore") – The algorithm/method used for outlier detection. One of ‘zscore’, ‘iforest’, ‘iqr’
- Returns
Boolean array with same length as the input, indices of outlier marked.
- Return type
numpy.array
Examples
>>> from eazieda.outliers_detect import outliers_detect >>> s = pd.Series([1,1,1,1,1,1,1,1,1,1,1e14]) >>> outliers_detect(s) array([False, False, False, False, False, False, False, False, False, True])
-
eazieda.outliers_detect.outliers_detect_iforest(s)[source]¶ Detects outliers in a pandas series using isolation forests
- Parameters
s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found
- Returns
Boolean array with same length as the input, indices of outlier marked.
- Return type
numpy.array
Examples
>>> from eazieda.outliers_detect import outliers_detect_iforest >>> s = pd.Series([1,2,1,2,1, 1000]) >>> outliers_detect_iforest(s) array([False, False, False, False, False, True])
-
eazieda.outliers_detect.outliers_detect_iqr(s, factor=1.5)[source]¶ Detects outliers in a pandas series using inter-quantile ranges
- Parameters
s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found
factor (int) – iqr factor used for outliers
- Returns
Boolean array with same length as the input, indices of outlier marked.
- Return type
numpy.array
Examples
>>> from eazieda.outliers_detect import outliers_detect_iqr >>> s = pd.Series([1,2,1,2,1, 1000]) >>> outliers_detect_iqr(s) array([False, False, False, False, False, True])
-
eazieda.outliers_detect.outliers_detect_zscore(s, threshold=3)[source]¶ Detects outliers in a pandas series using zscores
- Parameters
s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found
threshold (int) – zscore threshold used for outliers
- Returns
Boolean array with same length as the input, indices of outlier marked.
- Return type
numpy.array
Examples
>>> from eazieda.outliers_detect import outliers_detect_zscore >>> s = pd.Series([1,1,1,1,1,1,1,1,1,1,1e14]) >>> outliers_detect_zscore(s) array([False, False, False, False, False, False, False, False, False, True])
-
eazieda.outliers_detect.remove_outliers(s, outliers, inplace=False)[source]¶ Drops outliers from the given series
- Parameters
s (pandas.core.series.Series) – Pandas Series for which the outliers need to be found
outliers (numpy.array) – boolean numpy array with the same length as s. Outliers should be marked with True.
inplace (boolean) – do the removal inplace
- Returns
series with outliers removed. None if inplace=True.
- Return type
None or pd.Series
Examples
>>> from eazieda.outliers_detect import remove_outliers >>> s = pd.Series([1,1e14]) >>> outliers = np.array([False,,True]) >>> remove_outliers(s, outliers) >>> s 0 1.0 dtype: float64